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ABSTRACT 

A retrieval system is considered which document 
descriptions are stored and accessed in groups called clusters. All 
items in a cluster meet common similarity criteria and are 
represented by a composite entity called a profile. In large 
collections, profiles themselves are clustered and additional levels 
of profiles are generated. This entire process establishes a file 
organization for the system in that records are composed with a 
logical structure with a directory (profile hierarchy) to facilitate 
searching. Clustered files have the following advantages over other 
organizations; complete document information is stored in the same 
location; storage overhead is low; and flexible, economical searches 
can be realized. The problems investigated in clustered file 
organization are; profile definition, updating, hierarchy storage, 
and secondary profile uses. A comparison with an inverted file is 
included. Nearly all work has an experimental base and uses the SMART 
retrieval system. The proposed organization compares favorably in 
terms of speed and storage economy. Various request- document matching 
procedures, and feedback schemes are easily implemented. Search 
precision is less, but compensated by a flexible level qf recall — low 
or high. (Author/ SJ) 
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DOCUMENT RETRIEVAL BASED ON CLUSTERED FILE 
Daniel McClure Murray, Ph.D, 

Cornell University, 1972 

A retrieval system is considered in which document descriptions 
are stored and accessed in groups called clusters. All items in a 
cluster meet common similarity criteria and are represented by a 
composite entity called a profile. In large collections, profiles 
themselves are clustered and additional levels of profiles are generated. 
This entire process establishes a file organisation for the system in 
that records are composed into a logical structure with a directory 
(profile hierarchy) to facilitate searching. Clustered files have 
the following advantages over other organizations: complete document 

Information is stored in the same location, storage overhead is low, 
and flexible and economical searches can be realized. 

The problems investigated in clustered file organizations are: 
profile definition, updating, hierarchy storage, and secondary profile 
uses. A comparison with an inverted file is included also. Nearly 
»n work has an experimental base and uses the SMART retrieval system 
or facilities built around it. In this report, the initial chapters 
cover concepts in document retrieval, file organization, and clustered 
files. Chapter IV describes the experimental environment and a new 
evaluation scheme for cluster searches based on precision floor and 
recall ceiling. Chapter V deals with the preparation of unbiased, 
economical profiles. Several types (standard, rank value, rank, 
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shortened) and weighting schemes (none, partial, full) are studied. 

A reasonable profile can be constructed by using term weights based 
on frequency ranks and deleting a large percentage of low weight terms. 
Chapter VI indicates that profiles require only minor weight adjustments 
to incorporate new documents; however, some re-clustering should occur 
after 25^-50^ growth. Chapter VII develops a model of a disk storage 
algorithm and suggests storing the hierarchy by levels for most effi- 
cient access. Chapter VIII describes a scheme for query alteration 
during searches which uses term- term relationships in profiles. Chapter 
IX indicates that a clustered file uses no more space than an inverted 
file and provides more flexible search criteria. Chapter X is a 
summary of findings. 

In total, this thesis attempts to answer the question "Is a 
clustered file organisation suitable for on-line document retrieval?”. 
The proposed organization compares favorably in terms of speed and 
storage economy; various request-document matching procedures, search 
strategies, and feedback schemes are easily implemented. Search preci- 
sion is less, but compensated by a flexible level of recall (low or 
high). Furthermore, arbitrary accesses for individual records are not 
required since those records with a high probability of satisfying a 
request are concentrated in a few disk locations. Therein lies the 
greatest value of a clustered file. 
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Synopsis 



This dissertation examines the file organization problem in an 
on-line, computerized document retrieval system. Its ^im Is to demon- 
strate the utility of a clustered file in such an environment. The 
clustered scheme uses a classification algorithm to partition a docu- 
ment collection into overlapping sets of related items. The documents 

in each cluster are stored contiguously and accessed through a hierarchy 

* 

of profile vectors which "summarize" cluster content. There are 
several reasons why this approach is superior to conventional file 
organizations based on chains or inverted Indices. First, all data for 
each document are stored only once and in the same location. This 
allows the use of any query-document match function and the implementa- 
tion of advanced features such as relevance feedback searching and 
dynamic document space modification. Second, since there are no 
pointers or lists, the storage overhead promises to be low. Third, 
and finally, searches are economical and flexible since most clusters 
are eliminated during query-profile matching. Once a cluster is chosen 
for detailed examination, its contents are accessed rapidly since they 
reside in contiguous locations. Consequently, it is not the case that 
retrieval times and costs increase linearly with the number of documents 
examined nor with the query length. 

The experiments in this thesis are concerned with the profile 
hierarchy— its construction, use, and maintenance. These topics are 
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Impor tant since without accurate, economical profiles, cluster genera- 
tion is much more costly and the file cannot he searched properly • 

Chapter I contains an introduction to document retrieval, file 
organization, and the problems to be investigated. It is argued that 
an on-line retrieval system is required in order to overcome incomplete 
and inaccurate text analysis and to allow users to probe data bases, 
refine their requests, and control searches. Concerning implementa- 
tion, the choice of file organization is the primary factor in deter- 
Bining how information is accessed and how the costs are distributed, 

r' 

A clustered file is suggested as being flexible and powerful enough to 
handle the diverse document search criteria and requirements for 
information displays, while having favorable cost-performance 
characteristics , 

Chapter II surveys current file organizations, particularly the 
maimer in which they partition the file and provide linkage among re- 
lated records. Five general schemes are compared— sequential, chained, 
inverted, computer-access, and clustered. The results indicate that 
only the inverted and clustered organizations provide the speed 
necessary for on-line document searches. Both allow implementation 
of relevance feedback and document space modification $ however these 
options require additional storage space in inverted files. Generally, 
clustered files support more flexible search strategies, mt retrieve 
with somewhat less precision. These tradeoffs are explored fully in 
later sections. 
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Chapter III contains a thorough discussion of clustered flies 
and. background Information for the experiments. Classification methods, 
search strategies, and query clustering are considered briefly. The 
main topics are the construction, use, and maintenance of profiles. 
Three types of standard profiles are presented* P^, having no term 
welghtsi P 2 having weights based on document frequencies; and having 
weights based on total term frequencies. These are compared to Doyle's 
rank value profiles, P y , When additions are made to the file, main- 
tenance is viewed in terms of small adjustments to keep profiles in 
the '’center” of their clusters. However, ultimately, the collection 
must be re-classified; the frequency of this operation is to be de- 
termined, Other important matters include limitations on profile 
length and the order of storage (level, subtree, hier-filial) , Both 
of these help determine search times and storage overhead. 

The experimental environment is described in Chapter IV, in- 
cluding the characteristics of the SMART system, the Cranfiold document 
collection (1400 abstracts, 225 requests), and the three clustered 
files. Two evaluation schemes are presented,’ One is the regular 
SMART procedure based on precision and recall data for fixed search 
strategies. The second is a new, economical method which is independent 
of search strategy and accounts for system effort more accurately. It 
is based on measurements of recall ceiling (percent of relevant that 
are recoverable) precision floor (percent of total recoverable that 
are relevant) taken before each cluster is expanded , Naturally 
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clusters are expanded In order of decreasing similarity with the request 
vector. Both methods are used in Chapters V to IX. 

Chapter V reports on an extensive set of experiments with profiles, 
In particular, standard and rank value vectors, biased search results, 
vector length, and frequency and weighting considerations. The best 
profiles are those with term weights based on frequency ranks and hence 
are a compromise of the standard (P^ or P^) and rank value types. The 
use of ranks keeps the range of weights small and reduces correlation 
domination. Keeping the weight origin at a minimum eliminates bias 
and maintains maximum distinction among terms. The success with ranks 
Indicates that the importance of terms does not increase linearly 
with frequency, but in a more gradual way (approximately logarithmically). 
This accounts for the success observed with profiles using categories 
of weights and the partial success of unweighted profiles with deletion 
of ’•noise" terms. The tests also indicate that a large portion (80^) 
of low weighted terms can be eliminated with only small degradations 
in search performance. High frequency significant terms cannot be 
deleted, moved upward in the profile hierarchy, or assigned smaller 
weights than less frequent terms. Overall, the results indicate methods 
for constructing reasonable and economical profiles, but Indicate 
that further improvements can be made. 

The file maintenance experiments in Chapter VI are summarised in 
two findings. First, if reasonable profiles are used such as those 
described in Chapter V, then a clustered file may increase its size 
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25 &- 5 Q£ before the degradation in search performance is such that re- 
classification is required. These percentages are calculated as the 
ratio of additions to the current file size. Second, adjustments to 
profiles dur i ng updating are of only slight benefit. The most reason- 
able scheme is to adjust the weights of only existing terms and not to 
introduce new terms (ALTER option). Since the adjusted profiles keep 
the same size, they can overwrite their predecessors without destroy- 
ing any storage organization in the profile hierarchy. 

Chapter VII describes experiments involving the use of indexed 
sequential access for managing a disk resident clustered file. For 

forward search strategies, storing the hierarchy by levels is found to 

* 

provide the most rapid response. Furthermore, it is shown that cluster 
searching can retrieve many of the relevant documents obtained in a 
full search, but at much less cost. For example, cluster searches 
requiring 10-16 disk accesses achieve about 70 $ of the precision and 
recall values of a full search requiring 65 accesses. 

/ 

The basic idea of the query alteration procedure in Chapter VIII 
is to associate a thesaurus with each profile and to expand 

selected request terms during searches. Each mini-thesaurus reflects 
the pecularitles of the vocabulary in its cluster. Consequently, 
there is a unique opportunity to combine the use of broad, general 
term relationships on upper hierarchy levels and specific, local 
relationships on lower levels. Unfortunately, for the options tested, 
query alteration is of doubtful value when employed automatically. 
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However, it night be used profitably as part of a negative feedback 
procedure. 

Chapter IX compares the storage requirements, speed, and effective- 
ness of clustered and inverted files. The inverted organization re- 
quires twice as much storage space as a clustered file in order to 
provide equivalent retrieval services. For a specific number of disk 
accesses, the inverted search retrieves a fixed number of documents 
and generally achieves high precision at a specific recall level, Vith 
the sane effort, a cluster search provides many or few documents. 
Although its precision is less, the search may have a recall level 
which is higher or lower depending on the number of retrieved documents. 
This flexibility within searches of a single cost figure is considered 
a genuine advantage. 

The final chapter summarizes the work. The clustered file 
organization is considered more advantageous than other file organiza- 
tions since it allows greater flexibility in matching and searching 
without increased retrieval time or storage costs. 
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Chapter I 



Introduction 

1. Automated Information Systems 

A , General System Types 

Information management is of increasing concern in the modem world. 
Commercial, scientific, educational, and governmental institutions produce 
and distribute such large quantities of reports, research data, and other 
literature that it is difficult to keep abreast of almost any field using 
manual processes. Automated information systems — combining high-speed 
computing equipment, mass storage devices, and sophisticated programming 

f 

systems — appear to be one way of containing this information explosion, 

Hayes (l) distinguishes three types of systems based on their scope 
of activities— data base, reference, and text processing. Data base 
systems manipulate files of fixed-format records and generally provide 
capabilities for adding-deleting records, changing the contents of select- 
ed fields, and retrieving items with specified properties, A familiar 
example is that of an airline reservation system in which passenger records 
contain the name, flight number, destination, time of departure, and so 
forth. In the course of a day's activity, many records are created, 
deleted, and changed in order to reflect current business conditions. 
Queries to the file are expressed as logical combinations of special key- 
words and commands which the system is designed to interpret. In this 
respect, the operations are oriented toward experienced personnel rather 
than a general public. Data base systems have wide use currently, and a 
number of generalized software packages are available to meet most 
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Reference systems deal with more complex data structures such as 
printed text or pictures, and retrieve references to items rather than 
actual articles or photographs , The information content of a text or 
picture is contained in a set of manually assigned descriptors or keywords) 
sometimes automatic dictionary procedures are used as indexing aids. The 
same record processing facilities are available as in data base systems, 
although retrieval conditions may be relaxed from the strict criteria of 
a Boolean search formulation. In particular, a simple function of the 
number of matching request descriptors could determine which file items 
are closely related to a request. Subsequently only the highest scoring 
records are actually obtained for user inspection. Since highly structured 
queries are not needed, requestors should find reference systems easier 
to use than data base systems. The NASA, Medlars, and Chemical Abstracts 
retrieval services are examples of reference systems dealing with aero* 
nautical, medical, and chemical literature. 

Full text processing systems include automatic statistical, syntactic, 
and semantic procedures to format intricate data structures containing 
the implicit and explicit information in the original text. The search 
process is complex also, including correlation measures and linguistic 
processing, as well as man-machine interaction, data displays, and itera- 
tive searching. If complete text is stored, fact retrieval may be possible 
so that answers to questions are given rather than lists of references. 

At this time, information systems which approach full text processing— 
SMART, SIR, and STUDENT— are still in their experimental stages. 



B, Document Retrieval Systems 

A document retrieval system is a combination of reference and text 
processing systems usually limited to simplified content analysis (diction- 
ary or thesaurus), but containing complex search negotiation procedures. 
Syntactic and linguistic methods are often avoided for cost-effectiveness 
reasons# As in reference systems, document identification numbers are 
retrieved although citations, abstracts, or even full text could be 
printed if storage provisions allow. In the particular model used for 
this study, the subject content of a document is reflected in a set of 
weighted keywords derived automatically from the original text. Natural 
language queries are indexed in the same fashion and matched with file 
items using a correlation function. Those documents with the highest 
correlations have their accession numbers returned to the user. If more 
precise or complete information is desired, the search is continued using 
a feedback or other strategy. Additional information about the experi- 
mental system is given in Section 3 of this chapter and Section 1 of 
Chapter IV. 

The application of computer systems to document retrieval raises 
many intellectual and technical problems. Automated processing obviously 
substitutes software logic for the human intellect available in a good 
reference library. Regardless of their complexity, all programs operate 
without a genuine understanding of text and therefore are inaccurate by 
human standards. Undoubtedly, the overall goal of information retrieval 
is the development of methods which are equal to human ingenuity with 
respect to retrieving relevant documents ; however, research aimed at 
obtaining good procedures is complicated not only by the difficulty of text 
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analysis, but also by the nature of relevance itself. Since each user 
is the Judge of whether a document is relevant to his request, it is 
difficult to find measureable properties that always discriminate among 
documents properly. Moreover, users often do not have well defined 
information needs, but still require the system to retrieve specific 
data in answer to vague questions. Lastly, even if a query statement is 
precise, it may not correspond to the stored text in any reasonable way. 
To overcome these problems, a great many automated systems operate on an 
interactive basis. The system supplies tutorial instruction, information 
displays, file statistics, and iterative processing ;;hile users clarify 
ambiguous terms, expand or refine queries, and identify relevant or non- 
relevant documents in feedback processes. This combination of computer 
hardware for rapid processing and user supplied semantic inputs forms an 
integral part of the system design considered here. 

The most important Implementation matter associated with automatic 
document retrieval systems is the selection of a file organization. In 
nearly all instances the size of the data collection requires the file to 
be structured so that only part of it is manipulated for any operation. 
While the previous problems deal with semantic and human factors, file 
organization is the primary technical issue, involving interactions 
among file characteristics (size, complexity), user satisfaction (response 
time), software cpmplexity (search, interaction, maintenance), and hard- 
ware (processors, storage media). This thesis examines the file problem 
in an on-line document retrieval system with respect to querying the file 
bv subject content rather than bibliographic keys such as ruthor, Journal . 
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etc. Its aim is to demonstrate the feasibility and utility of a particular 
organization — a clustered file--in this environment . 

2# File Organization 

A. General Purpose and Definition 

The purpose of a file organization is to provide convenient and 
efficient use of stored data. Convenience relates to the ease of retriev- 
ing records, implementing desired software, and maintaining the file. 

Other factors include the ability to extend to future applications and to 

* 

take advantage of natural structures within the data, patterns of usage, 
and special features of operating systems. Measures of efficiency 
generally pertain to memory space (amount of waste, overhead, and redun- 
dant storage) and access time (overhead computations and i/o operations). 
User satisfaction may or may not be an evaluation criterion because of its 
subjective nature and close relation to the convenience of providing 
software services. Document retrieval systems, however, often measure 
user satisfaction by the quality of the retrieved material, Lefkovitz (2) 
draws the following distinction between file structure (organization) and 
information structure. File structure denotes the record layout, directory 
setups, and file partitions necessary to meet specifications for access 
times, storage economy, and maintenance effort. Information structure is 
an inherent property of the data that exists by design or by the way the 
data appear in the collection. This structure, whether natural or imposed, 
can be used as a basis for partitioning file items into groups. For 
example, items with a hierarchical information structure exhibit superior- 
inferior relationships and can be grouped accordingly; records in an 
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associative structure might be grouped if they share common properties. 

The file structure may or may not take advantage of the information 
structure. 

Another way of viewing file organization examines only its major 
purpose— data access. Vith respect to retrieval programs— search, display, 
maintenance— the information structure divides the file into groups of 
items which are logically related and processed together. The method of 
logical access determines how these related items are associated with 
each other. For example, associations might be implemented by chains, 
lists, naming conventions, physical adjacency, or functional relation- 
ships among record identifiers. Vith respect to the operating system, a 
file organization includes a method of physical access to facilitate 
locating records on actual storage devices. For example, if absolute 
addresses are used, records axe located by direct access. However, if 
names, reference numbers, or relative positions are used, then sequential, 
indexed sequential, or partitioned access may be applicable ( 3 ). Con- 
sequently a file organization specifies two interfaces ! 

1) between retrieval programs and the supervisor's 
I/O system (logical access) and 

2) between the i/O system and storage devices 
(physical access ) . 

The first interface is equivalent to establishing a file directory 
for use in search negotiation. It is of primary interest in this research 
because of the rich supply of data structures applicable to such direct- 
ories and because of the challenging nature of access by subject content. 
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The second Interface is given less attention because its options are 
generally limited to those supported by a manufacturer's software, 

B, File Organization in a Document Retrieval System 
In all, a document retrieval system manipulates a variety of data 
including dictionaries, thesauri, search vectors, citations, and 
abstracts. Each of these may exist as a separate file or several may 
be combined in an integrated file with a single access method and multi- 
ple directories. In particular, dictionaries and thesauri are often 
combined for use by content analysis routines and search vectors and 
retrieval data may be integrated for search negotiation. In either case 
processing generally proceeds through the files one at a time starting 
with a dictionary lookup of query terms and ending with the display of 
retrieved document citations. However because of interactive processing 
a file may be accessed several times in succession to display informa- 
tion, process an altered query, or enter new data. Part or all of this 
cycle may be repeated during iterated searches. In an on-line environ- 
ment, all processing must occur fast enough to satisfy user impatience; 
only file maintenance is considered an off-line operation because of its 
non-critical nature. 

For this study, a document file is defined as containing! 

1) the searchable descriptions of texts (document 
vectors), 

2) initial retrieval data (reference numbers or 
short citations), and 

3 ) whatever directories are necessary, 

3Z 
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Primary attention Is given to the problem of logical file organization 
with respect to subject searches, that Is, logical access. Queries In- 
volving only bibliographic Information are assumed to be processed through 
separate directories. Even limited to subject access, the organization 
task Is complicated by adverse file characteristics (enormous size, 
rapid growth, and variable length records), the demands of sophisticated 
programs operating on-line, and the elusive nature of relevancy. It may 
be recalled that relevancy Is a user defined property and is not neces- 
sarily limi ted to records having the same descriptors as the query. As 
a result, logical access Includes defining connections among items with 
loose semantic relations (imposing an Information structure) as well as 
providing a directory for accessing them. 

The previous section sets forth the purpose of file organization In 
terms of convenience and efficiency, With regard to on-line document 
retrieval, a few design criteria must be emphasized or added. First, 
the absolute prerequisite Is real-time response. Although, it may be 
possible to process several requests simultaneously (batching), the 
system must be prepared to treat users independently since it is unlikely 
that any two requests In a batch pertain to the same portion of the 
file. Second, user-machine interaction is definitely necessary to 
clarify and satisfy information needs. This includes presentation of 
data and processing sensitive to user responses; examples are tutorial 
Instruction, query formulation aids, browsing, query alteration processes, 
and feedback (iterated) searching. Third, query-document matching is to 
be based on a correlation function rather than satisfying a logical 
predicate or simple coordinate Hatching. Either of the latter methods 
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could be used} however, correlation functions provide greater flex-, 
ibility in their treatment of both queries and documents. Specifically, 
a change in the function or its parameters may allow more strict or 
lenient scoring, different emphasis on vector properties, or the addi- 
tion of new factors to the entire procedure. Fourth, an appropriate 
cost-performance tradeoff is desirable to satisfy users wanting small 
amounts of information at low cost and those willing to pay for compre- 
hensive searches. Finally, evaluation standards must include not only 
time and space, but user satisfaction. The standard measures of precision 
and recall* are used to evaluate the quality of retrieved material and 
thereby judge user satisfaction, 

A complete set of design characteristics and evaluation measures for 
file organizations undoubtedly Includes additional aspects of informa- 
tion systems, However, those outlined so far are the most Important and 
serve as a basis for comparisons in this work, 

3, The Document Retrieval Environment 

In order to be explicit about the type of retrieval system envisaged, 
the following sections contain descriptions of the document file, user 
factors, software services, and disk storage devices, 

A. Document File Characteristics 

A document retrieval system operates on a collection of natural 
language articles Indexed in some fashion and stored as searchable 



* Considering the documents retrieved by a suarch, precision is the 
percent of retrieved which are relevant and recall is the percent of 
relevant which are retrieved, 
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vectors. The document file Itself Is characterized by its large size . 
rapid growth, and variable length records . 

With respect to file size, it is noted that storage for the Library 

12 

of Congress catalogue would have required 10 bits in 1962, Further, a 
number of public and university libraries already contain in excess of 
a million volumes and the general trend is for holdings to double every 
15 years (4) . Even a modest system for handling journal articles in a 
specialized field might expect several thousand additions per year (5) • 
These facts clearly indicate the need for mass storage devices and, for 
really large files, several classes of storage media — disk, data cell, 
photostore, tape. 

In addition, decreases in file size are rare because future editions 
of texts or re-publications of articles are generally viewed as new items 
rather than as replacements for old ones. In situations where current 
literature is of primary importance, the document file may be segmented 
into active and archive storage. However, this does not provide genuine 
reduction in many cases since all data remains retrievable and directory 
entries for older items are transferred in and out of memory during 
general processing. In most information centers the files only increase 
in size as new items are entered during periodic off-line maintenance 
runs. Usually, there is no significant demand for a real time update 
capabilitv. 

Texts obviously require variable length records because of their own 
variable lengths, the automatic indexing process, and the amount of 
retrieval data. Usually a variety of standard access points are included 
in the vector— author, publication date, journal or call number, headings— 
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as well as the set of Index terms for describing subject content. In 
all, 200-300 characters are not unusual lengths for document records, 

fi. User Factors 

Several user situations must be anticipated In the design of a 
successful on-line retrieval system. First, at icast two classes of 
users must be treated; new or infrequent users demand simple, straight- 
forward query submission, search, and retrieval functions while exper- 
ienced persons need flexible and powerful procedures. Second, the 
system must be able to cope with vague questions since in many cases 
a user in either class is unable to state his information needs accurately. 
Third, even if a precise query statement is given, it may not correspond 
to the stored text in any reasonable way. This is especially true if t 
important query terms have extremely high or low frequencies. In the 
former case, the search may yield too many items and in the latter case 
not enough items. Fourth, only some of the retrieved documents will 
be relevant due to inaccuracies in the indexing process, matching func- 
tions, etc. This condition necessitates some form of query alteration 
and iterated search if more relevant information is to be obtained. 

In an on-line environment, many of these problems are solved through 
user responses to information displays. For example, during query formu- 
lation, synonyms for the original query words might be presented along 
with, frequency information in order to facilitate selection of proper 
terms. Or the search process might be interrupted before its completion 
and early results examined to check that appropriate search paths are 
followed. Once the search is completed, citations or text for the 
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retrieved documents can be viewed. If the results prove unsatisfactory, 
new searches could begin Immediately. (See Section I.3.C) 

To summarize, the success of a document retrieval system requires on - 
line operation to overcome various user factors as well as Incomplete 
text analysis . For these reasons, this research Investigates file 
organizations which are suitable for on-line systems with respect to 
cost and flexibility In the retrieval process. 

C. Software Services 

The most satisfactory way to accommodate users is through on-line 
operation and options for selecting a wide variety of processing methods. 
A typical search negotiation in such a system is described by the flow- 
chart in Figure 1-1, The overall requirements for query formulation 
aids, document display or browsing, and simple query alteration have 
been mentioned. The search, relevance feedback, and space modification 
functions require additional explanation. 

Subject searching involves several phases of activity. To start, 
assume that a query has undergone dictionary, thesaurus, or other 
analysis and that its content is represented by a vector of index terms 
and associated weights. The next step is to calculate correlations be- 
tween the query and the most promising documents. How this is accomplish- 
ed depends on the file organization. Recalling that most file systems 
partition records into logically related groups, the usual procedure is 
to examine all the items in several groups. For example, consider a file 
using chains to link documents having the same keyword. The chained 
records form partitions whose elements are located by following sequences 
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of pointers. To search the file, some of all of the chains named by 
query terms are followed and a correlation is computed for each item 
encountered. Similar situations occur for serial, inverted, calculated, 
or clustered organisations. (See Chapter II.) In any case, once all 
correlations are gathered and sorted by value} the names, citations, or 
abstracts for the highest scoring documents are obtained. At this time, 
a set of secondary restrictions may be applied before a presentation is 
made to the user. Restrictions often include bibliographic or quantity 
limitations on output, verification of word order within the text, and 
others. Regardless of the secondary processing, the major portion of 
the search effort involves accessing file partitions and accumulating 
correlations. These are the significant aspects as far as file organiza- 
tion is concerned. 

Relevance feedback is a tool for obtaining improved search results 
through iterative searching (6, 7» 8) • Briefly the process works as 
follows. A user at an on-line console views the citations or abstracts 
of a few documents retrieved by an initial search. Immediately or after 
consulting the source texts, he enters decisions as to which items are 
definitely relevant or non-relevant, possibly leaving a few items un- 
judged. The system alters his original request by adding or emphasizing 
descriptors found primarily in the relevant documents and by removing 
or reducing the importance of terms found primarily in the non-relevant. 

A new file search is made, the results presented to the user for hi6 
Inspection, and the entire process repeated if desired. Several varia- 
tions of this general scheme are possible. Positive feedback employs 

only terms from the relevant documents whereas negative feedback uses 
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only the non-relevant, These nay be used separately, jointly, or 
selectively depending on the output of the initial search , Positive 
techniques are especially valuable when some relevant are retrieved and 
additional similar items are desired. Negative methods produce some 
success even if no relevant are retrieved in the initial search or if 
the relevant obtained are dissimilar. If a clustered file is used, docu- 
ments from separate clusters may generate distinct queries for operating 
within each cluster. Such query splitting procedures have been investi- 
gated by Borodin, Kerr, and Lewis (9). Although evaluation is difficult, 
at least one investigation of feedback searches reports improvements of 
to 10# in precision and recall, the standard performance measures for 
information systems (7, 8), Thus feedback has been shown to be an 
effective retrieval tool in an on-line environment and should definitely 
be included among the software services. 

Document space modification tries to pass on the success of previous 
searches to future users by making the retrieved relevant correspond 
more closely to the original query (10, ll) . Considering the set of 
documents obtained after feedback and all other processing, each relevant 
item is modified according to whether its terms appear in the document, 
the query, or both. A similar, but inverse modification is applied to 
the retrieved non-relevant so that there are positive and negative 
strategies here also. Experiments by Brauen indicate substantial 
improvement in future performance based on these methods and indicate 

their importance to current retrieval systems. 

Both techniques affect file organization in that they require access 
to entire document vectors . Some file schemes distribute the index terms 

AO 



of a vector throughout memory so that its entirety is unretrievable in 
a 8 ingle access. This may he acceptable for the search process, but in 
order to use feedback or space modification, complete vectors must be 
easily obtained, 

D. Storage Devices 

The discussion of document file size points out that considerable 

memory space is required for even a moderate size retrieval system. In 

•• 

general, direct access devices have the most appeal in the solution of 
the mass storage problem. Disks and drums provide sufficient capacity 
and speed to warrant their use for dictionaries, directories, and docu- 
ment vectors. Text and occasional items may be assigned to data cells, 
photostores, or magnetic tape. 

Throughout this research, the IBM 2314 Direct Access Storage Facility 
is us jd as a model device . Its hardware consists of eight interchange- 
able disk packs (volumes) each with a capacity of 29 million characters 
(12), The average access time is approximately 100 milliseconds for a 
36 OO character data block providing there is no queue for the single i/O 
channel servicing all packs. This also excludes whatever delays are 
encountered while unblocking records. Specifically the access time con- 
sists of waits for the completion of four events; 

1) access motion— positioning the access arm (set 
of read-write heads) at the proper cylinder, 

2 ) head selection— switching to select the read 
head for the appropriate track, 
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3) rotational delay— rotation necessary to position 
the read head at the start of the data, and 

4) data transfer-- reading the data into main memory. 

As a result of the mechanical motions involved , disk fetches require 
several milliseconds and are slow and expensive relative to internal com- 
puting speeds. Cylinder changes are most costly as a substantial amount 
of time is required to accelerate the access arm. For this reason it 
Is better to input a single large data record than several small ones 
scattered throughout the volume. Other delays are much smaller, the 
head selection time being negligible, A full revolution of the pack 
requires 25 ms regardless if data is read. Consequently, the expected 
rotational delay is 12.5 ns and the transfer time is proportional to the 
amount of information moved. Table 1-1 summarizes the storage and 
access specifications for the IBM 2314 (12). 

There are many ways of reducing the delays from disk fetches. 
Accesses can be confined to the same cylinder if possible or programs 
might be implemented so that processing on the current record is finished 
in time to fetch the next record while in the same rotation. However, 
the best way to insure short, real-time response is to minimize the 
number of fetches. This holds true even in time-shared or multiprogram- 
med systems, for a small number of i/o interrupts means that program 
execution is suspended less often. As a result, the primary measure of 
response time and the quantity for minimization is the number of disk 
accesses. 
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4. The Clustered File Organization 

A. General Concepts 

A clustered file is arranged so that similar documents are located 
near each other within the storage medium. Generally an automatic 
classification procedure is used to compare document descriptions with 
each other and define the so-called clusters. Retrieval programs either 
process all or none of the items in a cluster and therefore transfer 
large blocks of data from mass storage rather than many smaller individ- 
ual records. For each cluster, a profile characterizes its Information 
content and acts as a directory element. Loosely speaking, a profile 
is an aggregate of the index terms in the clustered documents and has a 
structure resembling a document or query vector. The search process 
matches a query with each profile and examines in detail only those 
clusters with the highest scores. For large collections, there may be 
a great many clusters (profiles) and the classification procedure is 
often re-applied to group profiles. The result is a hierarchical 
directory similar to that in Figure 1-2. 

Referring to Lefkovltz's ideas (2), the classification imposes a 
hierarchical information structure on the data and the tree storage and 
search procedures are parts of the file organization. A hierarchical 
structure is natural to a document file with respect to usage since users 
are able to find analogies between the automated search and their per- 
sonal library activities. In an Interactive environment, a user might 
browse within the file by viewing information in the profiles. Subse- 
quent operations could allow query alteration before or even during the 
search. Finally if at least one relevant document is located, additional 
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pertinent articles should be found in the sane file area ( 

B. Areas of Investigation 

This thesis treats three aspects of a clustered file organization - 
profile definition and storage, updating, and secondary uses of the 
hierarchy . Because profiles are aggregates of document vectors , their 
definition involves many factors including limitations on the number of 
Index terms, weighting procedures, and number of allowable sons. All of 
these affect the system's ability to discriminate among clusters and 
hence the accuracy of performance. The actual order of tree storage 
strongly influences response times and search cost. Pile update tech- 
niques are important in order to maintain speed and accuracy. New items 
must be re-distributed occasionally to achieve physical proximity of 
related information and thereby maintain reasonable data access times. 

In addition, profiles might be modified during update in order to reflect 
the presence of new cluster members, A few additions make little differ- 
ence, but it is unreasonable to expect a hierarchy to perform well after 
the file has increased its size several times. Finally, the high cost 
of current classification methods leads to the idea of using the profiles 
and document hierarchy for several purposes. By spreading the clustering 
expense among all applications, this overhead is more easily justified. 
Several possibilities are discussed, including the concept of associating 
a term theasurus with each node in the hierarchy. Thus, statis- 

tically related thesaurus terms are available for modifying requests as 
the search enters various parts of the collection. 

These are the problems for consideration and all of them are related 
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by their association with the cluster profiles — either definition, 
storage, changes, or secondary uses. Actual classification methods 
are not examined here, but surveyed in Section III, 2. Search techniques 
are described only to the extent that they are used in the research, 

5. Outline 

The present chapter outlines the purpose and operation of a document 
retrieval system, the file organization problems within it, and the 
general processing environment in which these problems are to be solved. 
The next two chapters concentrate on file organization methods, first 
surveying current schemes, then describing the clustered file in con- 
siderable detail. Chapter IV is devoted to the experimental system and 
evaluation methods used in this research. Chapters V to VIII are given 
to the areas for investigation— profile definition, updating, tree 
storage, and query alteration based on profile information. The final 
chapters include a comparison of the clustered and inverted organizations 
and the conclusions and future investigations suggested by this work. 
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Chapter II 

Survey of File Organizations 
1. Introduction 

Most file organizations partition their records into groups related 
"by some natural or imposed criteria* The logical access method specifies 
how related items are associated while the physical access method facili- 
tates translating record names into storage addresses. Here, file 
structure and logical access are emphasized rather than the details of 
data management services (physical access) , This chapter surveys exist- 
ing file organizations applicable to a document collection and evaluates 
their utility in an on-line retrieval system . In these systems, a search 

consists of at least four steps. 

1) directory scan— choosing the file partitions for 
detailed examination; 

2) accumulation of query-document correlations within 
the selected partitions; 

3) selection of items for output— sorting correlations, 
fetching retrieval data, and applying secondary re- 
strictions; and 

4) information display. 

Particular attention is given to the directory scan and the generation 
of correlations since these are the most crucial and costly steps. 

In all, five types of organizations are discussed* sequential, 
chained, inverted, calculated access, and clustered. Some variations 
and combinations are considered also. Much of the terminology is 
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adapted from previous surveys by Lefkovitz (l), SaXton (2), Lowe and 
Roberts (3), and Meadow (4). As far as evaluation is concerned, most 
surveys examine retrieval time and storage usage. In add i tion to these, 
the present discussion considers the evaluation quantities outlined in 
Section 1.2. B — convenience of implementing desired software features, 
maintenance , quality of retrieved material, cost-performance tradeoff, 
and general appropriateness to on-line document retrieval. 

2. Methods of Logical Organization 

A. Sequential Piles 

A sequential file organization stores documents in the order of 
their acquisition and retrieves them by a complete scan of all records . 
In fact, the file shows little organization at all since it has no parti- 
tions, no directories, and no particular order for item storage. 

Because the complete file is scanned during a search, the retrieval time 
is prohibitive for on-line operation. However, sequential files are 
still justifiable in current awareness systems or in other situations 
where it is possible to accumulate requests over short periods of time. 
In these cases, a search is made when the query batch is large enough to 
yield an acceptable cost per user, A number of retrieval systems use 
the sequential organization and employ this batching technique (5, 6, ?). 

In spite of their large access time, sequential files have several 
favorable properties. First, their storage requirements are minimal, 
since no pointers, linkages, or other overhead is involved. Second, 
information can be retrieved via almost any criteria since entire docu- 
ment vectors may be examined. For example, bibliographic information; 
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the presence, absence, or weights of Index terms; and almost any correla- 
tion function can be used for query-document matching. Third, mainten- 
ance is trivial since new records are simply stored at the end of the 
current file. 

Finally, the expense of structuring the file in any other way may 
force the use of the sequential scheme. This is obviously the case if 
the volume of activity is so low that the cost of organizing the file 
exceeds the total cost of the searches made, Coffman and Bruno recognize 
this situation and suggest splitting the file into structured and sequen- 
tial parts (8). The search process begins ; in the structured portion and 
proceeds to the sequential portion only if necessary. Items retrieved from 
the sequential subfile are subsequently transferred to the structured sec- 
tion. A slightly more complicated scheme would record the number of times 
a document is retrieved and retire information from the structured subfile 
when its retrieval rate falls below a chosen threshold. Exactly where 
the file should be split is unclear. Lipetz considers the tradeoff 
point between sequential and some other organizations; his results may 

have bearing on this problem (9). 

A si m l l ar scheme proposed by LeimJcuhler partitions the file into 
bins so that the probability of a successful search is maximized for the 
effort expended (10). Each bin fJrras a sequential subfile of the next 
group of documents judged equally likely to satisfy queries. The search 
is cumulative; always starting with the same initial bin and proceeding 
to the last bin, unless the user is satisfied earlier. The points of 
file division and expected search effort are computed from a form of 
Bradford's law relating literature productivity, number of references 
per document, and collection completeness, Leimkuhler favors a 2- bin 
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system having 20 & of the most pertinent documents in the first bin* It 
is expected that 2/3 of the requests are satisfied by searching only the 
initial subfile. 

In a more restricted situation, Ghosh (ll) considers searching a 
sequential file to find documents containing all the descriptors found 
in the query (not necessarily implying true relevance), A set of 
queries is defined to have the consecutive retrieval property (CRP), if 
there is a sequence for storing all records so that the documents 
satisfying each query are located consecutively, A file ordered in this 
way has not only mini mum storage requirements, but also minimum search 
time if a suitable directory is used, Ghosh shows that an arbitrary 
query set can always be divided into overlapping groups of queries with 

It 

CRP, The complete file, thfin, is the concatenation of sequential sub- 
files associated with each query group. Although this organization 
appears interesting and perhaps applicable to finding actual relevant 
documents, no implementation or evaluation information is known. 

These extensions of sequential files do not change the fact that 
their access times are too slow to permit on-line search, display, feed - 
back. etc . The results indicate, however, that sequential organizations 
cannot be discarded too quickly, especially when used in conjunction 
with other techniques. The schemes of Ghosh and Leimkuhler actually 
border on single-level document clustering. The fact is, that serial 
order is acceptable for parts of a document collection, but not its 
entirety. The problem is to decide how the file should be partitioned 
and how each portion should be accessed. 
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B. Chained Files 

Logically, the chained file organization partitions documents into 
sets of vectors having common index terms. The elements of each set are 
chained together and accessed by following a sequence of pointers . The 
file directory consists of a table giving the head of each chain while 
actual document vectors are stored in order of acquisition. A new docu- 
ment is attached to the file by placing it at the head of the chains 
corresponding to its index terms. As a result all chains are in decreas- 
ing order by accession number and point to the most recent information 
first. The file search procedure using this type of organization is 
depicted in Figure II-l, A binary search, hashing method, or other 
lookup technique is used to scan the directory and locate the beginning 

t 

of the document chains corresponding to each query term. Subsequently 
all items along these chains are fetched and correlated with the query 
vector to establish which documents are suitable for output. Once the 
retrieval cutoff is chosen, accession numbers or other retrieval data 
are displayed. In some situations a separate file is accessed in order 
to present complete citations or abstracts for the documents at the top 
of the ranked list. 

Vith a chained organization, a document vector is a set of triples 
(t,w,p) where t is an index term, w is its associated weight, and p is 
a pointer to the next document containing t. Only the weight is an 
optional component; the term identifier is necessary to detect matches 
and to differentiate among the chains which intersect in a document. 
Because a pointer is required for each assigned index term, the total 
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space needed for them is considerable. Depending on the number of bits 
allocated to each element of the triple , the file size may increase as 
much as 100£ (12), The space required for the file directory is small 
in most cases and presents no genuine implementation problem. 

In order to alleviate some of the overhead for pointers, the multi - 
list file combines several descriptors into a "superkey" and maintains a 
single chain for items containing all descriptors ( 13 , 14), To effect a 
savings, the vocabulary must contain term pairs or triples which occur a 
significant nuaber of times throughout the file. Unfortunately there are 
very few super-keys which meet these conditions. Tests involving triplets 
show that 90 % of the super-keys occur only once. As a result, the multi- 
list structure is only slightly better than chaining single keys and does 
not provide the storage economy that is -needed. 

Information systems with a variety of interrelated record types 

\ - 

often use chains to reduce the storage of redundant information as well 
as to provide the desired access points. The Integrated Data Store (15) 
circularly chains common data elements and addresses them by pointers 
from other parts of the file. Similarly, records having the same attri- 
bute values are chained in rings which may also point to other rings. 
Logically, the file is a highly inter-connected network of data elements 
and records accessed through a single directory. Physically, items are 
packed into pages and fetched from disk through the data management 
facilities of COBOL, Several other programming systems, CLP (16), APL 
(17), and DMS (45) provide facilities for defining and manipulating 
similar structures. Although ideal for many files, ring structures pro- 
vide few benefits for information retrieval since the typical document 
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file has only one record format, little repetitious data, and few complex 
linkages. 

Considering data retrieval in chained organizations, the search 
time Is proportional to the total length of the chains involved plus a 
small amount for the directory scan. Naturally chains intersect at 
various places, and duplicate effort may result. For example, if one 
chain Is followed to Its end before another is considered, then some 
documents are visited twice and unnecessary accesses are made. This 
situation is eliminated by considering the next element in all chains 
before fetching a new document vector. Since each chain is ordered by 
accession number, always selecting the highest number insures that no 
items are re-scanned. In addition, all correlations are gathered in a 
single sweep across the disk, i.e. without jumping back across data 
previously passed over. If only recent data are required, it may be 
possible to terminate the search early. Specifically, the accession 
numbers used as pointers might contain an indication of publication date 
or dates could be examined directly in the vector. Regardless of these 
factors, search time is still approximately one access per document 
correlation, although this may be reduced somewhat further by using 
blocked records. Although better than sequential files, it is doubtful 
that a chained organization provides access that is fast enough for on- 
line operation, 

.Lefkovitz (l) describes two variations of the chained organization 
which are designed to reduce search time~the controlled list length and 
cellular organizations . In the first case, each document chain is 
limited to a pre-determined maximum size. When the maximum is exceeded, 
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another directory entry is made and the list is continued. In the 
second case, a new directory entry is made each time the chain crosses 
a cell boundary. A cell is defined as a convenient sub-unit of the 
storage device; for disks, the cylinder is an appropriate choice. In 
either case, the directory size and scan time increases. In systems 
testing for complete matches of query terms, it is possible to intersect 
sections of chains and thereby eliminate some documents not having all 
query terms. However, matching based on correlation functions allows 
documents to be retrieved even if they do not completely match the query. 
As a result, all chained items must be examined and the intersection 
process is of little value. 

Chained file organizations have some advantages in an on-line 
system. Pointers make it possible to relate documents in a wide variety 
of ways; e.g. similarity of keyword assignments or bibliographic data; 
statistical correlation; or a priori knowledge of related publications, 
former editions, or other factors. Moreover, a number of pre-search 
statistics are available to aid query formulation. For example, chain 
length indicates the number of items indexed by each term (specificity 
of the query) and the total list length is cert ain ly an upper bound on 
the number of retrievable documents. Such information might form a 
useful display. Real-time updating is also possible since only a few 
pointers must be changed to incorporate new items into the file. Finally, 
since document vectors are stored intact, both relevance feedback and 
space modification procedures are possible. 

Nevertheless, the disadvantages of chains outweigh their advantage s. 
First, precautions must be taken to avoid and repair breaks duo to 
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hardware or software failure. Second, in spite of easy file addition 
procedures, changing an existing document is difficult, especially if 
its vector must be enlarged. Finally, the main objections are the heavy 
storage overhead for pointers and search times proportional to the number 
of documents correlations. The last disadvantage Is a particular hind* 
ranee since better retrieval is generally obtained by examining an 
increasing number of documents. 

C . Inverted Files 

The Inverted file organization has wide use in on-line retrieval 
systems because it provides quick access to data (18, 19, 20, 2l). 
Logically the file is partitioned into sets of documents having a common 
Keyword, although records themselves are stored in any order . The 
directory contains one entry per vocabulary termi each entry lists the 
accession numbers of documents containing that term. In a sense, docu- 
ments are chained, but all pointers reside in the directory rather than 
in vectors. Figure II-2 depicts the structure and use of a file organ- 
ised in this way, that is, using lists of accession numbers inverted by 
index term. A search begins by scanning the directory to obtain the 
lists associated with each query term. After merging the lists, each 
vector is fetched from disk, matched with the query, and the resulting 
correlation stored in a ranking table. Finally, the highest scoring 
documents are retrieved and displayed. 

As outlined, the inverted file produces no better performance than 
a chained organization. The retrieval time is still proportional to the 
number of correlations since complete vectors are fetched from random 

o 

ERIC 



i 



. 58 



II- 1.1 




Retrieval 



Ranking Table 




To be 
retrieved 



Legend a *=> 

c « 
t « 
w » 



document accession 
number 
correlation 
index term 
weight 



R “ retrieval data 
D ■ document vector 



Structure and Search of a Document File Using 
Lists of Accession Numbers Inverted by Index Term 



Figure II-2 



59 

o 

ERIC 



i 



IT-12 



disk locations. In fact, the situation is worse since the directory 
Is quite Tar ge and cumbersome. However, retrieval time can be substan- 
tially reduced by eliminating many items during the directory scan. For 
systems using Boolean queries, this is clearly the case since conjunc- 
tion of search keys imply list intersections while disjunctions imply 
unions* As a result the only documents actually fetched are the ones 

which completely match the request. 

A similar proced ur e is possible with some correlation coefficients 
for unw€»ighted vectors. The ranking table is used as an area for accumu- 
lating the number of matching terms between the query and various 

I 

% 

docu men ts. For many match functions, these totals constitute the numer- 
ators of the coefficient* normalization is all that is necessary before 
the final value is obtained. There are several ways to obtain the 
normalizing denominators for coefficients* 

1) access their document vectors, 

2) provide a special table, or 

3) include them with the accession numbers in the 
directory lists. 

None is very pleasing. The first case returns to the situation of 
accessing all items on the merged list in order to obtain a complete set 
of correlations. The other methods are faster, but increase the directory 
size. Regardless of cost, it is possible to obtain complete correlations 
after just the directory scan, and the resulting savings are considerable. 
Retrieval time is reduced to an amount proportional to the number of 
query teiws plus the time needed to fetch data for final display— 
certainly fast enough for on-line work. Moreover, since correlations 



o 



11-13 



are computed from directory Information, Index terms are no longer 
needed as part of the stored document vectors. In fact, If only acces- 
sion numbers are output, the entire file might consist of just the 
directory, resulting in storage requirements approximately the same 
as for a serial file. This is certainly a pleasing situation for initial 
searches, but it obviates the use of relevance feedback and document 
space modification. 

To review, both relevance feedback and space modification produce 
increased user satisfaction and are, therefore, desirable tools in a 
retrieval environment. However, they require the entire contents of 
document vectors— in one case to obtain index terms and in the othur case 
to modify them. The inverted directory contains all this data but in 
term order rather than document order. Consequently the only realistic 
way to implement these features and maintain on-line operation is to 
have both the inverted and main files present. This combined file 
approach nearly doubles the storage requirement. And in the case of 
space modification changes must be made to both files. 

The inverted organization can be modified to work with weighted 
query and document vectors also. On one hand, the system could operate 
as outlined in Figure II-2. On the other hand, the directory can be 
augmented to permit a more rapid search. For example, let D » (d, ,d_, . . . ,d ) 
and Q » .*.»q n ) be a document and query vector respectively. Here 

d^ and q^ denote the weights assigned to the i^ 1 term. The cosine 



correlation between these vectors 1st 
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(II-1) 



In order to compute COS(Q,D) during the directory scan, the values of 
d 1 /|D| are stored In the appropriate directory lists along with accession 
numbers* This Is possible since all the needed information is available 
when the document is added to the file* During a search, the total 
correlation is accumulated in the ranking table by summing the contribu- 
tions from matching terms* The contribution from matching terms with 
weights q^ and Is: 

CONTRIBUTION - ^j- - (^-(idt) (II-2) 

The lest hand factor in the last equality is obtained from the query and 
the right hand factor from the directory. It is important to note that 
normalization is already included in the directory entries. An example 
of correlation calculation is shown in Figure II-3» With this procedure 
correlations are obtained after the directory scan and only the 
retrieval data for the highest scoring documents must be fetched. Further, 
both terms and weights may be removed from the original vectors if feed- 
back and space modification procedures are not used. Assuming the 
system design includes these features, a combined file must be main- 
tained; again the storage overhead is approximately 100^, 

The advantage of computing a correlation during the directory scan 
is the rapidity of the search. Although this technique works with the 
cosine correlation, it is inconvenient or impossible with others. Each 
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matching function has its own peculiarities) but in general two require- 
nents must be met. First, non-matching document terms must not have a 
direct influence on the coefficient since only the lists corresponding to 
matching terms are examined. However, in some cases, such information 

may be derived from other quantities already on hand. Second, the com- 

* 

pu tat ion must permit the accumulation of total coefficient. In the case 
of the cosine function, this is accomplished by storing normalized con- 
tribution values in the lists of accession numbers. Of course, extra 
storage allocated to each list element could facilitate almost any 
computation, but practical considerations generally impose some limita- 
tions. As examples, consider the product -moment, Tanimoto, and overlap 
correlations! 



HI . tegLi .(p-j). 

IQ-Q. I I D-D I 



(II-3) 
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a»p 

nQ + n D - Q*D 




Min (q^) 
Min (nQ,n D) 



(II-4) 

(II-5) 



where Q « ~ and D - ~ ^d^. The product-moment measure cannot 

be computed in the manner suggested because of negative contributions 
from non-matching terms. The Tanimoto and overlap coefficients do not 
allow for easy calculation since values of q^, d^, ^ d^ must be 

available. In order to compute either correlation, weights for document 



/ 



terms must be included in the accession lists and the values of 
must be obtained from a separate table* 




I 



JT-JL'7 



\ 



Henceforth, It is assumed that system design permits calculation of 
correlations within the directory so that the problem of quick response 
reduces to the question of rapid access. Unfortunately accession lists 
have variable and often considerable lengths and many standard directory 
scanning techniques do not work. Length is especially troublesome if 
automatic document indexing is used and nearly every major word is 
treated as an index term. Without other controls, it is not uncommon 
for a list to spam several disk tracks. Collmeyer and Shemer (22) con- 
sider forming serial, tree-structured, and hash-coded indexes to the 
accession lists and conclude that hashing is preferred. With a hashing 
procedure, the storage location of an accession list is computed from 
the bit pattern for the corresponding index term. Higgins and Smith 
(23, 24-) give considerable attention to hash-coded indexes, looking at 
methods for computing addresses and handling overflows. Equal attention 
is given to the storage of accession lists to permit easy access and 
maintenance. Exponential chaining is developed as technique for handl- 
ing file additions. Generally, in the updating process accession lists 
must be lengthened, possibly by re-writing them or by chaining the 
overflow to the original list. In time, the directory deteriorates 
into chains of updated entries, which slows processing. The exponential 
chaining procedure adds an entire block of available space to the over- 
flow chain each time the previous block is filled. By continually 
increasing the size of the additional block, lists are kept from being 
too fragmented in storage. A variation of this scheme uses periodic 

» 

maintenance runs to collect all segments of a chain and to re-write it 
as a single block along with an exponentially increasing number of 
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available overflow spaces. 

To summarize, the inverted file search examines only pertinent 
records and yields an acceptable search time at the price of increased 
storage space . The time results from three factors* 

1) index searching to locate the appropriate accession lists, 

2) fetching accession lists and computing correlations, 
and 

3) obtaining retrieval data for the documents to be 
displayed. 

The first two factors are proportional to the number of query terms 
while the third is related to the amount of desired output. It is 
unfortunate that search time increases with query length since this 
penalizes longer queries which generally perform better. In addition, 
since the relevance feedback process adds additional terms to queries, 
the second and third iterated searches are considerably more expensive. 
The storage space and complexity is the big drawback to this organiza- 
tion, In order to get all the desired features, combined files must be 
used and substantial penalties incurred in terms of space and mainten- 
ance, The problems of long accession lists, overflows, and chained 
blocks within lists are considerable. Generally, the same retrieval 
features available with the chained organization are also applicable 
to the inverted file, 

D, Computed-access Files, 

Comxmted-access files are those that try to approach large scale 
content addressable memory. In other words they manipulate the contents 
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of a query vector to calculate the storage address of a group (bucket) 
of documents which are highly similar to the query . Hopefully, many of 
these documents are actually relevant. Hash addressing (scatter storage) 
is a computed access method that has been used successfully with a single 
search key (25, 26, 27, 28, 29). To extend this idea to several keys 
(index terms), a conglomerate address must be calculated which is 
descriptive of the entire query or document and the query addresses must 

I 

be mapped into the document addresses. The first scheme discussed below 
follows this idea to some extent; while the second scheme uses concepts . 

i 

from finite geometries to compute addresses. Regardless of the method, 
calculated-access differs from previous organizations in thati 

1) the file is partitioned into groups of items related 

\ 

in some mathematical manner and 

2) a major portion of the access procedure is based 
on computation rather than on a directory scan. 

Files and Huskey describe a retrieval system using super-imposed 
coding which partially meets the above criteria ( 30) • Document vectors 
are maintained in serial order as described in Section I1.2A. A 
directory entry is made for each document consisting of a code word 
(N bits) and a pointer to its vector. This directory is kept in serial 
order also. Now consider a specific document. The character string for 
each of its index terms is hashed to a value between I and N, and the 
corresponding bit in the document code word is turned "on”. Al. ,ther 
bits remain "off". Once the document code word is generated in this 
manner it is placed in the directory. To search the file, a query code 
word is generated by the same hash function and is matched with all 
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document codes. Each time the query code is a subset of a document code, 
the pointer to the document is saved. Later the pointers are used to 
access complete vectors and correlations are computed (See Figure 11*4) . 

The scheme appears neat and simple, but depends on making the 
number of bits per code word large enough to eliminate too many false 
drops, and at the same time small enough to make the directory manageable. 
The serial directory scan is slow, perhaps, but does allow for easy up- 
date. For on-line work, more speed could be obtained by sorting the 
documents bjr code word. This facilitates bulk transfers of data rather 
than individual records. Further, if the directory itself is ordered 
according to the first K bits of its code words, a secondary table might 
act as an index to it. Given a query code word, the first K bits become 
a subscript to the secondary index which leads to a substantial section 
of the directory and then to blocks of documents. As supplemented, this 
approach might prove feasible with respect to storage space and access 
time. There is a problem with accuracy, however. Examining a document 
only when its code word is a complete superset of the query code word 
may prove too strict. On one hand, documents containing only a few query 
terms are not examined under this principle, whereas they might actually 
have high correlations with the query and even be retrieved with a 
different method. However, examining a document when there is one or 
more matching code word bits would be such a lenient criterion that most 
of the file would be scanned. Solutions to these problems approach the 
clustering techniques and profile generation methods reserved for later 
discussion. In its original form, the super-imposed coding system is 
probably not feasible for on-line document retrieval. However, proper 
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enhancements could make it more useful. 

For some time, Ghosh and others (31, 32, 33 . 3 ^» 35 ) have considered 
the following problem. Assume that document vectors are stored serially 
and consider a directory table holding only the document accession num- 
bers. The problem is to organize the index into buckets so that given a 
query, the proper bucket addresses are generated by solving algebraic 
equations. In the context of the work, the "proper bucket" is a bucket 
containing the accession numbers of documents having all the query terms. 
The problem is solvable for some special cases* for binary vectors, for 
queries with a small number of terms, and for queries with two multiple- 
valued attributes. In general, the organization represents query 
attributes (terms) by dependent or independent linear forms (hence the 
\ded query size). The system of equations to be solved in connection 
with retrieval is Hx = v where H is a matrix of coefficients, x is the 
vector of query attributes, and v is the vector of attribute values. 

The elements of H are elements of a Galois field of finite dimensions, 
sometimes being powers of its primitive elements. The solution 
x ■ H” 1 v is used to generate bucket addresses and the proper documents 

are obtained. 

As fine as the concept sounds, the organization is more a mathe- 
matical object than a working reality. The reasons are twofold. First, 
there are difficulties with the size and computations for the matrix H 
as well as reducing Hx ■ v to echelon form, obtaining a solution, etc. 
Second, the redundant storage factor for these schemes is very high 
because of the nature of the task, namely computing bucket labels for 
any combination of terms. Only a few example calculations are available; 
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but for 3 attributes of 9 values each (729 record types), the scheme uses 
81 buckets and, on the average, stores a record 2.8 times. For 2 attri- 
butes of 9 values and 3 attributes of 3 values, the redundancy factor is 
7.0, As a result, this approach falls short of the desired situation— 
economlcally handling thousands of attributes of many values apiece. 

Unfortunately, no really good calculated access methods are known 
at this time . 

E, Clustered Files 

A clustered file partitions documents into subject classes and uses 
a hierarchy of profiles to describe and access each cluster . The term 
clustering refers to the use of a classification scheme to produce 
groups (clusters) of statistically related items. Early research on 
classification methods was conducted by Needham (3 6 )# Doyle (37)» Parker- 
Rhodes ( 38 )# and others. The combination of document clustering, profile 
hierarchies, and on-line searching developed more recently, mainly due 
to the work of Salton's SMART project (39, 40, 4l» 42). Because a great 
deal of descriptive material is presented in Chapter III; the discussion 
here is confined to relating clustered files to other organizations. 

Logically, the file is partitioned into subfiles (clusters) 
using the similarity criteria of the classification procedure. Generally, 
many 0 f the grouped documents share several index terms rather than just 
one or two. Hopefully, this implies some semantic relationship among 
the texts also. However for retrieval purposes, all that is required is 
for documents in a cluster to have greater internal similarities than 
external similarities. If profiles are made which reflect this bias, then 
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the best classes for searching are those most similar to the query. As 
mentioned earlier, a profile is a composite of subordinate vectors, so 
its format makes it possible to measure query-cluster similarities by 
profile correlations. The search process begins by computing query- 
profile correlations for the nodes on the highest level. After ranking, 
nodes having correlations above a chosen cutoff acre expanded , That is, 
the sons of the nodes axe obtained and the correlation-expansion procedure 
is applied to them. The search, then, is an alternating series of 
correlations and expansions, terminating with document correlations and 
actual retrieval. 

Throughout the organization, items are stored serially. This in- 
cludes profiles on the first tree level, the sons of each node, and the 
clusters of documents vectors. Since large blocks of data (sons of a 
node) are transferred from disk rather than individual items, the number 
of accesses made in a search is proportional to the number of expanded 
nodes. For a very narrow strategy— expanding only the best node at each 
step — the number of accesses is approximately equal to the number of 
levels. For broader strategies, the cost increases accordingly. This 
situation produces a welcome cost-performance tradeoff, A narrow search 
yields only a few relevant documents, but at low cost; broader searches 
are moire comprehensive and more expensive. Of course, a full file search 
is always possible by scanning all documents directly. In all cases, the 
mar ginal cost of retrieving a few extra documents is small. 

The storage overhead for this organization is basically the space 
for profile vectors. It is difficult to state the total requirement 
accurately because it depends on the properties of the hierarchy — 
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number of levels, a degree of nodes within those levels, and the amount 
of overlap. (See Chapters III and V.) This research shows that accept- 
able performance levels are maintained even when the profiles are* reduced 
to only 7fo of the space allocated to the document vectors. The result is 
substantiated under a variety of conditions and using hierarchies with 
widely different properties. 

The expense of classification is the largest disadvantage of the 
clustered file. The number of operations to classify N items may be 
proportional to N 2 , to NlogN or to N depending on the method. There is 
some evidence that more expensive methods produce better clusters, but 
cheaper schemes are not completely unacceptable either. This study dis- 
regards the classification cost and suggests that hierarchies can be 
used for several purposes. By distributing expenses among several appli- 
cations, clustered files are more easily justified. Alternate uses in- 
clude automated browsing for viewing how information is structured, check- 
point displays during searching, construction of a retrieval thesaurus, 
query alteration procedures, and others. (See Section 8 of Chapter III.) 

The maintenance of a clustered file is a drawback also. For a time 
new documents can be successfully blended into the existing hierarchy 
with or without changes to profiles. Eventually the quality of the 
hierarchy diminishes because the profiles try to represent too much 
Information. This may imply a complete re-clustering. However, a more 

reasonable scheme is to record the number of additions to each path of 

»• 

the tree and to re-cluster selectively. That is, to re-structure only 
a single subtree of data and to re-connect it to the rest of the 
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There are a number of other advantages of a clustered file in addi- 
tion to those of time and space , Since complete document vectors are 
available, relevance feedback is easily implemented. Document space 
modification may require changes to both documents and profiles, however. 
In addition, almost any correlation coefficient can be computed because 
complete information is at hand. Several modes of operation might be 
provided by simply changing the matching function. Cost performance 
trade-offs and browsing features have been mentioned. The general 
appropriateness of the hierarchical structure is not to be overlooked 
either. Its concept is familiar to most users through their previous 
libEi. / activities, and it provides a reasonable structure for browsing 
and observing the relationships among the stored texts. In summary, th9 
clustered organization not only facilitates storage and access, but also 
participates in other phases of the retrieval process, 

3, Methods of Physical Access 

Physical access deals with the process of locating records on a 

• \ 

storage device regardless of their relationship with other file items. 
For example, suppose that the search process follows a chain or expands 
a cluster and consequently requires correlations with documents D 17* d M’ 
and These records must be fetched from disk, deblocked from their 

track format, and transferred to work areas for the correlation program. 
The physical access method specifies how the operating system actually 
finds these vectors on disk. Methods are generally limited to those 
supported by a manufacturer's software since the task is one of inter- 
facing the operating system and a storage device. Three schemes in wide 
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use are t sequential, direct, and indexed sequential (43), 

Sequ e ntial access is most applicable to magnetic tape, although 
used with disk also. It is simple in concept and places little burden 
on the operating system or application programs. Records are stored in 
order of increasing reference numbers and located by their relative posi- 
tion in the file. Given the current position, a desired record is found 
by comput ing the number of intervening items and reading or skipping over 
this intermediate information. Although slow for accessing random disk 
locations, sequential schemes provide the most rapid way of obtaining 
multiple or quantity input from a localized area of the file. 

Direct access is the fastest way to obtain records from random disk 
locations. The operating system is given the disk address of desired 
items and actually does little more than initiate i/O activity and trans- 
fer input data to program buffers. The application program has the 
burden of supplying the required disk addresses, generally using lookup 
or hashing methods. In the first case, the file construction and update 
processes return addresses of stored items. These are used directly as 
pointers, elements of acoession lists, etc. Whenever an item is needed, 
its disk address is obtained by table lookup or scanning a list. In the 
second case, a record accession number is hashed into a disk address and 
the record is stored at this address or in a connected overflow area. 

The advantage of hashing is that it avoids a lengthy lookup process » its 
if 4 sad vant ag e s include problems of overflow and selection of a hash 
function. 

Indexed sequential access is easier to use and slower than the direct 
method, but more restrictive and faster than the sequential scheme. 
Briefly, the operating system constructs a set of hierarchical directories 
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as the file is built (43, 44, 46). Application programs access items 
simply by specifying their reference numbers (keys). To start, records 
are stored in order of increasing key value in the prime data area of 
each disk cylinder. The rest of the cylinder consists of a track index 
and an overflow area for new items. During this process, the key for 
the last record associated with each track is stored in the track index. 
Moreover, the entire file is preceded by a cylinder index containing the 
key of the last record associated with each cylinder. Several levels of 
master Index may be used to point to sections of the cylinder index. The 
result is a hierarchical directory structure which is distributed about 
the disk (See Figure II-5) • 

To retrieve data, the operating system (or data management system) 
compares the key of the desired record with the entries in the highest 
level index. Descent to lower level indexes is made depending on 
whether the key has a high, equal, or low match. Prime data tracks are 
searched sequentially to find the proper record. Actually, retrieval 
is more complex because of overflow conditions resulting from file up- 
dates. Suppose a new record is to be Inserted in the middle of the file, 
but will not fit onto the desired track. In this situation, the entire 
track is re-written in proper sequence and items pushed off its end are 
moved to an overflow area. Overflow items are linked together and to 
their home track by making a second entry in the track index. Consequent- 
ly when the track index is examined during retrieval, either the prime 
area is searched sequentially or the overflow area is searched by 
following a chain. Overflow tracks might be allocated for each cylinder, 



o 

ERJC 



76 



TI-?0 



Master Index 



Cylinder Index 





0FE5 - overflow entries in the track Index 

Index entries are the largest document numbers per 
cylinder, track, overflow chain, etc. 



Indexed Sequential Access 
Figure II-5 
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for the entire file (single area at the end), or both. In the last 
ease, the secondary area contains items overflowing from the cylinder 
areas. 

Indexed sequential access is selected in this research as the method 
most appropriate to the clustered file . It is particularly useful since 
the sequence of cluster access is unpredictable and its indexes provide 
the required retrieval speed. However once a cluster is selected, its 
volume of data is retrieved as a single entity and sequential, rapid 
transfer is most important. Because the most efficient operation occurs 
when entire tracks of information are transmitted from disk, it is assumed 
that records are not deblocked by the operating system. Master and 
cylinder indexes are also assumed to reside in core storage as well as 
the track index for the cylinder currently being accessed. These same 
conditions are used when the inverted and clustered organizations are com- 
pared on the basis of search time. Because direct access may be more 
appropriate for inverted files, the results are presented in this way 
also (See Chapter IX) .• 

4. Summary 

This chapter surveys current file organizations applicable to direct 
access devices. In accordance with the purpose of this work, logical 
structure— partitions of records, linkages among related items — have 
been emphasized rather than data management functions (physical structure) , 

In all, five types of logical organization are described and evaluated 

( 

with respect to usage in on-line document retrieval. Table II-l 
summarizes the evaluation of sequential, chained, inverted, calculated- 
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access methods to make a complete comparison. 

Some comments must be added to the summary. First, all organiza- 
tions are designed so that relevance feedback and document spacf 
modification could be implemented. For sequential files, this involves 
expending a great amount of time while it means maintaining a combined 
file for inverted organizations. For clustered schemes, profiles might 
have to be changed during space modification. The point is that both 
features can be made available, but with varying costs. Second, in the 
area of pre-search statistics, the chained and inverted schemes supply 
the number of documents containing specified index terms, and other 
information which might be of interest to an on-line user. Clustered 
files cannot provide this same data, but offer the possibility of inter- 
rupting a search and viewing checkpoint information. Third, regarding 
quality of output, the chained and inverted files supply the same pre- 
cision and recall values as a full search (serial scan). Exceptions may 
arise, however, in the treatment of documents with tied correlations. 

The quality of a cluster search varies with the expended effort so that 
a cost-performance tradeoff is possible. However, most relevant docu- 
ments are retrieved early in the search. It might be possible to achieve 
a similar tradeoff with an inverted organization by examining the acces- 
sion lists of only a few query terms. However, the system is really 
processing a different query under this condition. Finally, the major 
differences between the inverted and clustered schemes is the tremendous 
storage overhead of the former and the classification expense of the 
latter. The maintenance difficulties for both are non- trivial. In the 
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end, the additional flexibility and general appropriateness of the 
clustered file may be the deciding factors in the choices made by 
actual systems designers. 
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Chapter III 



Clustered Files 

1, Introduction 

The purpose of this chapter Is to present a comprehensive discussion 
of clustered files . Individual sections are devoted to classification 
methods, hierarchy formation, search techniques, updating, cluster storage, 
request clustering, and alternate uses for the hierarchy* The intent is 
to provide an understanding of the construction, use, and maintenance of 
clustered files and to introduce concepts used in the experiments described 
in later chapters. 

2. Classification Methods 

A clustered file organization depends on a classification algorithm 
to group documents with similar properties* Since document properties 
(keywords) reflect subject content, clusters actually consist of seman- 
tically related items even though the partition is made on a statistical 
basis. Classification methods can be characterized in at least two ways* 
by direction of application and by amount of work. Generative methods 
arrive at the final hierarchy by "bottom-up M processing. Initially all 
documents are considered as individuals and highly similar items are 
placed In classes (clusters) . A profile is constructed to represent each 
class and higher level clusters are produced by grouping profiles. The 
process continues until only a small number of items remain. Divisive 
methods construct the hierarchy in "top-down" fashion. Initially all 
documents are placed in a single class which is divided into a few large 
subclasses. Thereafter each subclass is treated independently and is 
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spilt into additional subclasses. This division process continues until 

the size of a subclass falls within specified limits. Regardless of the 

method, clustering entails considerable expense, and algorithms can be 

characterized by the number of operations required to classify the N items 

on a single level. A great many algorithms employ similarity matrices 

2 

giving pairwise associations among items; in general N operations are 
required to generate such a matrix. Other algorithms partition or group 
items into crude subclasses and then refine the division. Often, the 
elements of each subclass are compared with profiles for other classes and 
then shifted about. If there are k subclasses, then the total clustering 
effort is proportional to kN operations . Some schemes do not re-distribute 
items after assigning them to initial subclasses and thus handle them 
only once per level. These are so-called one-pass algorithms . Apart from 
top-down and bottom-up application and the work involved, classification 
algorithms are based on a wide variety of techniques including eigenvalue 
analysis, factor analysis, latent class analysis, clump theory, and others 
(l) , For most information retrieval applications, a scheme must provide 
control over cluster size and overlap, and at the same time it must not 
demand excessive computation or storage space, 

Sokal and Sneath (2) describe a number of clustering algorithms 
using similarity matrices. For the present, let denote the similarity 
between documents i and j. In the single linkage method , the similarity 
between classes A and B is defined as 



The hierarchy is made by choosing a threshold 0 and joining pairs of 
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documents or classes which have similarity coefficients greater than 9. 
The threshold is reduced and pairing resumes; the entire process continues 
until a termination condition is satisfied. The average weight method is 
similar, but uses a different measures 




" "A n B i€A 







lj 



Here n. and n B axe the number of items in their respective classes. The 

complete linkage scheme joins two classes if and only if 0 for all 

16 A, j€B. Each of these algorithms generates the hierarchy from the bottom 

2 

up and requires N calculations to make the similarity matrix plus the 

work involved in the pairing process. Unfortunately, the problems of 

2 

matrix storage and calculation generally prohibit the use of N methods 
in document classification. 

One-pass classification algorithms examine each item only once per 
level. Again, the procedures rely on similarity coefficients and a 
threshold 6. As records (documents or profiles from the previous level) 
are read from the input device, the first item starts a cluster. The 
second item is compared with the first; if the second item is 

assigned to the same cluster; if S^< 9 • the second item starts a new 
cluster. Subsequent items are compared with all previous classes and 
either join the best existing class or start a new one. Nagy and Casey 
(3) require each item added to a cluster to have a similarity greater 
than 9 with all previous members. Hill (4) and Johnson and La Fuente 
(5) avoid examining items in a cluster by using a profile to represent 
its members. Similarities are measured using the profile and altering 
its makeup to reflect the addition of new items or the loss of old ones. 
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The cutoff 0 can remain constant or vary to provide better control over 
the size and number of clusters (5). Obviously, one-pass methods are 
applicable In top-down or bottom-up fashion. The work involved is diffi- 
cult to establish because the number of clusters changes as the algorithm 
proceeds. However if K clusters are eventually formed from N items, the 
total work for that level is bounded by W* operations. One-pass methods 
are definitely economical of time and space, but are questionable in 
terms of quality. Although promising results axe known ( 5 ), additional 
research is needed in this area. 

Between the N and one-pass methods is a class of top-down algori- 
thms which re-adjust an initial partition of the eocuments in order to 
improve the cluster quality, Rubin and Friedman ( 6 ) start with a random 
partition and use hill-climbing techniques, "forcing passes", and "re- 
assignment passes" to minimize the data scatter within clusters. Lit of sky 
( 7 ) partitions the vocabulary of a node (all index terms in the corres- 
ponding documents) before associating documents with these sub-vocabularies. 
Similar methods by Doyle and Dattola ( 8 , 9 ) are used exclusively in this 
study. To start, a number of unrelated documents are chosen as cluster 
seeds and their vectors are used as profiles. Next follows the scoring 
cycle in which each document is compared with all profiles. If for each 
document, the highest document-profile score is above a given threshold 
0 , the document is placed in the corresponding cluster at the end of the 
cycle. If the highest score is less than 0, the document is assigned to a 
special class of looso items. After all documents are scored, the cycle 
ends} profiles are re-defined to reflect the gain or loss of cluster 
members} and parameters are adjusted to control size and overlap. 

When a cycle results in no changes to the classification, an iteration 

is said to have ended. At this point, if the number of loose documents is 
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large, the cutoff 0 Is lowered and a new Iteration begins. If only a 
sma ll number of items remain loose, processing terminates for this level. 
Remaining loose documents may be blended into existing clusters or passed 
on to the next hierarchy level as individual items. Although the algorithm 
appears complex, it has a number of desirable features, Including control 
of cluster size, overlap, and disposition of loose items. The process is 
applied in top-down direction and works in time proportional to kN opera- 
tions per level, where k is the number of cltisters on that level. 

At this time, optimal classification methods have not been found 
for use in document retrieval. All methods axe moderately successful de- 
pending on the data and amount of computation, but still not enough is 
known about methods for automatically identifying and grouping related 
pieces of text. The difficulties may lie in the document indexing or in 
the classification methods themselves. Classification is not the topic 
for investigation in this research; and although its problems are not 
mentioned further, the limitations of current knowledge does affect the 
results presented here. 

3. Hierarchy Formation 

A cluster hierarchy is a connected arrangement of profiles which act 
as a file directory. In the previous section, cluster generation is 
described as a level by level process having one set of profiles asso- 
ciated with each level. The hierarchy is constructed by connecting each 
profile to its constituent elements on the next lower level. The result is a 
tree structure whose roots are top level profiles, whose leaves are documents, 
and whose intermediate nodes represent the middle level profiles. The tree 
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is used for both retrieval and updating functions, and in both cases 
each profile's purpose is to characterize all documents beneath it. For 
retrieval, the documents most similar to a query are found by comparing 

the query with the first level profiles and following the most promising 

/ 

paths. For updating, a new document is treated as a query and its vector 
is stored with the most similar lowest level cluster. In addition to the 
broad questions concerned with the hierarchy structure, two topics of 
specific concern are methods of linkage among nodes and methods of defining 
profiles . 

A. General Structure 

A systems designer needs to satisfy the desires of several types of 
users and to maintain a balance among response time, search cost, and 
retrieval performance (precision-recall). To some degree, these measures 
depend on the general shape of the hierarchy— cluster size, amount of over- 
lap, and number of levels, all of which are interdependent themselves. 

Large clusters of documents or profiles produce a hierarchy with only a 
few nodes and levels compared to a hierarchy based on small clusters. As 
a result a search consists of fewer profile comparisons and is quicker 
and cheaper. The output, however, is inferior since presumably it is 
more difficult to detect the presence of a relevant document by examining 
the profile of a large cluster than to detect it by examining the profile 
of a small cluster. 

Not a great deal is known about hierarchical shape and most classifi- 
cation algorithms avoid the problem by providing parameters for cluster 
size, overlap, number of levels, etc. An experimental approach to finding 
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optimal cluster size is to classify a sample collection several times, 
to search it using the same search strategy and. keeping the amount of work 
constant, and to determine the best cluster size based on the precision- 
recall plots. Even if this difficult test sequence is maintained, the 
results may not be applicable to different classifications, search para- 
meters, or collections. The problem of controlling the experiment centers 
on keeping the search effort constant while changing the cluster size, A 
more basic difficulty is that of choosing an adequate and appropriate 
measure of search effort* The above test scheme is difficult, but super- 
ior to methods which examine only search time or storage space. At this 
point, no optimal cluster size is known. The values used in this study 
are based on the findings of current research and their appropriateness to 
practical stltuations. 

In general, clusters overlap so that a document may have multiple 
memberships, and may be retrieved from several search paths. The optimal 
amount of overlap has been investigated in a manner similar to that out- 
lined above. That is, cluster size is held constant while overlap is 
varied and the evaluation is based on precision-recall curves from searches 
using fixed strategies. The problems related to work measurement are 
simplified because cluster size is constant* Results by Dattola ( 9 ) 
suggest that a small amount of overlap is best (2^-10^) , This is an 
important finding because of its influence on search times and the method 
of linkage among nodes of the hierarchy, 

B* Linkage Among Nodes 

Linkage between levels of profiles or profiles and documents can be 



accomplished in at least two ways. Pointers might link a parent to Its 
sons either hy starting a chain in the parent node and placing one pointer 
in each son or by collecting all pointers in the parent. With indexed 
sequential access, record keys are used as pointers rather than disk 
addresses so that the file remains relocatable. By assigning keys in 
increments greater than one, new documents or profiles are blended into 
the file by giving them keys which fill the vacancies in the sequence. 
Actual vectors are stored in data overflow areas as described In Chapter 
IX, Deleting a record from the file is accomplished by removing its 
pointers from the hierarchy. Unfortunately, in some cases pointers limit 
the extent of advanced buffering since the current record must be obtained 
before the next access is initiated. Another problem occurs if updating 
overflows the intervals between record keys. However, because pointer 
linkage is easy to manipulate and is applicable regardless of the amount 
of overlap, it is often used with a great deal of success. 

Implicit linkage methods assign each profile and document a structured 
key which is a sequence of digits identifying the path locating its node 
in the hierarchy (Figure Ill-la) , Given a node for expansion, its key and 
degree are used to derive the keys of its sons. Buffering may be advanced 
as far as possible since the keys of all desired records are generated at 
once. One of the great advantages of a structured key is that it contains 
a complete description of the record's place in the file. From a single 
key, parent, son, or filial nodes can be accessed for information that 
would be useful in broadening or narrowing searches, in browsing processes, 
or in relevance feedback. In addition, knowledge of the local structure 
allows for file integrity checks and possible reconstruction following 
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b) Orders of Cluster Storage 
Figure III-l 
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hardware or software failures* Structured keys meet the requirements of 
incre asing v alu es for indexed sequential access and even permit several 
storage orders. Right- justifying keys and sorting them into ascending 
sequence, places nodes in order by levels; left- justifying and sorting 
results in subtree order; and their combination yields heir-filial order 
(Figure Ill-lb) . During updating, new pieces can be added to any part 
of the hierarchy without disturbing previous linkages. As before, the 
new vector is located in a data or overflow area and only the degree of 
the parent node is altered* Item removal is best accomplished by setting 
a deletion indicator and leaving the rest of the record intact. Space 
is reclaimed dur ing maintenance processing when items are shuffled about 
to achieve more efficient operation. At this time a record may be 
assigned a new key, but this poses no serious problems since the keys are 

only used for interned identification. 

Under the assumption of a low percentage of overlap either method of 
linkage works economically. Structured keys require overlap to be handled 
by duplicating document vectors and storing them with each cluster in 
which they have membership. An advantage of duplication is that each 
cluster becomes a contiguous block of data and therefore accessible at 
maximum speed. Pointer linkage can be used to save the space given to 
duplicate items, but cluster, access time increases due to the excess 
jumping about the disk to fetch the overlapping items. The space require- 
ments for a structured key and degree is slightly larger than that needed 
for pointers and record keys, but probably not enough to make a real 
difference. Assuming a small amount of overlap, the preference between 
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methods must be given to structured keys, -based on the ease of update a nd 
the extra links to related Information. 



C. Profile Definition 

One of the assumptions of cluster searching Is that profiles for 
relevant clusters are more similar to a query than other profiles, A 
similar assumption connected with classification maintains that a high 
document-profile similarity implies a large degree of similarity between 
the document and all current cluster elements. Consequently, a good 
profile definition is crucial to the success of a clustered file. 
Basically, a profile attempts to characterize all the documents in its 
crown , that is, all documents reachable from it by descending paths. 
Given a specific node, a profile is loosely described as a collection 
of index terms found in those documents. Consider a node whose crown 
is the document set C “ * ^e ^°^ 0W * n 6 standard profile 

definitions are of general interest: 

1) Profile 

Let » ( d n» ^12' * * * ,ci iv^ 1x5 a vector of unweighted index 
terms representing the i^ document. In particular, let 
d^j - 1 if term j is assigned to and d^ - 0 otherwise, 
for j ** 1, 2, • , , ,v, 

P 1 " (Pll* p 12 V “ D l v V"- VD n (III_1) 

simply denotes the terms in all clustered items. In other 
words p^ =* 1 if and only if there is at least one document 
containing term jj no importance weighting is used . 
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2) Profile ? z 



Using the same document description, the vector 



P 2 - (p a . P22 Pay) “ h * D Z *• ’ ■* D n 

Is a weighted profile in which p^ is the number of clustered 

items which term j appears. In this manner, P 2 exhibits 

document weighting , 

3 ) Profile P - 

th 

Let d. . be the importance (weight) assigned to the y term 
in The profile 



P 3 " ^ P 31* p 32 ,,,#,p 3v^ = D l * D 2 +,,,+ D n ( ln '3) 
exhibits term weighting in the sense that is the total 

importance assigned to term j in all clustered items. 

The definitions begin with two different types of document keywords- 
weighted and unweighted— and produce either weighted or unweighted pro- 
files, Many retrieval systems using manual indexing do not use weights 
because keywords are carefully chosen and weighting appeal’s superfluous. 



In the case of simple automatic indexing, information for computing mean- 
ingful weights may not be available. In either instance, P^ profiles 

\ 

carry out the philosophy of no weighting and produce economical vectors 

'"’N. 

in terms of storage requirements. Although unsophisticated, it should 
not be surprising to find that P^ is adequate for distinguishing among 

^as^iB^inr groups of cohesive documents (clusters), Litofksy uses a 

\ 

modified form of this profile in his work (7). 

The Pg profile starts with unweighted documents and introduces 
weights to emphasize terms which cause cluster formation. This defini- 
tion is suitable when converting the unweighted documents of an existing 
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retrieval system to a clustered file organization. Hill (4) and Dattola 
( 9 ) both use forms of the P 2 vector? in the latter case, rank value 
weighting is used also. 

Automatic indexing generally converts every major word to an index 
term. As a result, many index terms are assigned to document and it is 
useful to associate a weight with each term to indicate its importance. 

In the collections used in this study, term frequencies within a document 
are automatically assigned as weights. This idea is carried over to 
profiles where a weight is the total term frequency In C, The SMART 
system (10) and the work of Rubin and Friedman ( 6 ) make extensive use of 
profiles. At this point, it is possible to argue convincingly in 
favor of each profile definition based on cost-performance, emphasis to 
inner-document or intra-document properties, or sensitivity to cluster size. 
Later, investigations consider each definition— its weighting procedures, 
term deletion, and update characteristics. 

In a more theoretical treatment, each document is considered as a 
point in v-dlmensional space, /in excellent profile would be the center 
of mass of the clustered vectors;, namely 





(III-4) 



If the values of the vector matching function are independent of the factor 
then P and P- profiles yield the same results as the center of mass 

H 4 y 

for their respective systems. This condition holds for the cosine corre- 
lation mentioned earlier, C0S(Q,P)- However, for the cosine, the 
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CV 



1 

n 




(III -5) 



1=1 



Pil 



is a superior profile in the sense that it maximizes the average correla- 
tion between itself and all elements of C (ll) , The difference between 

P and P is a matter of a slight shift in emphasis on the terms con- 

cm cv * 



trlbuting to the total correlation as evidenced by the following equations. 






n 




i«l 




COS(Q,D i ) 



C0S(Q,P cv ) 




(III-6) 



In the first case, term multipliers are sensitive to vector magnitudes, 
while in the second case, each term has the same multiplier. Actual 
differences in values are slight, however, since many documents with 
similar properties are generally involved in the sums. 

The following figures show examples of all profile definitions for 
node 1 of the hierarchy shown in Figure III-l, Figure III-2 depicts 
and Pg profiles while Figure III-3 shows P^ vectors. In both cases, the 
upper half of the drawings show profiles as they might actually be stored 
on disk. The lower half shows the contributions to the total cosine 
correlation made by a single match involving each term. As expected, 
the P and P vectors produce almost identical contributions. It is 
also observed that P^ profiles generally have a larger range of weights 
than P2 profiles. As a result, terms of low weights contribute much less 
to a correlation in P^ vectors than in P^ vectors. An opposite relation 
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holds In the case of high weighted terms. 

As Illustrated, and profiles assign weights based on document 
or term frequency, Doyle suggests an alternate profile which characterizes 
a cluster by a vector of keywords whose weights are rank values (8), A 
rank value Is the difference between a base value and the rank assigned 
to the term If all terms In the vector are ordered by decreasing frequency. 
Such a profile Is made by the following procedure, 

1) Rank all vector terms by frequency; that is, the 
most frequent term has rank 1, etc. Terms with 
the same frequency share the same rank, 

2) To the 1 th term, assign the weight v^ = b-r^ where 
b is a base value (constant) and r^ is the rank 
given to the term in step 2). 

The base value is a pre-selected constant chosen large enough to Insure 
that all are positive in all profiles. Although weights are based on 
frequencies here, it is easy to see how the rank value technique could 
be applied to almost any weighting scheme. 

Rank value profiles exhibit two major differences from the vectors 
considered earlier. First, weights are derived from frequency ranks 
rather than frequency counts, although these two are somewhat related 
(12), As a result, the weight range in a typical vector is reduced con- 
siderably. The reduction is greatest when many documents are considered 
and the summed frequencies would become quite large. The Importance of 
a small weight range is the reduction in the range of contribution values 
associated with vector match functions. The second difference is that 
the constant base value and subtraction process guarantee that all vectors 
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have the same high weight rather than the same low weight. This also 
decreases the range of contribution values; in fact, the higher the base 
value, the smaller the range. Figure 111-4 shows the application of rank 
value weighting to the and profiles considered previously. Two 
base values, 5 and 10, are used. Both the use of ranks and the increase 
in base value are observed to decrease the range of contribution values 
from those in the previous example. 

All forms of profiles mentioned here as well as some variations are 
the basis for experiments in Chapter V, In particular. P^. P«, and 
are evaluated under actual search conditions with and without rank value 
weighting. Bach of the ma.ior differences of rank value vectors is exam - 
ined as well as several altogether different weighting -procedures . 

Finally, vector length and internal processing are considered in some 
detail . This section simply presents the standard and rank value profile 
definitions and provides several examples of their application, 

4. Search Strategies 

The search procedure for a clustered document collection can be 
described as a series of correlations and expansions. Specifically, a 
query is compared with all nodes on the first level of the hierarchy 
(correlations) and one or more of the highest scoring nodes are replaced 
by their sons (expansion) . This cycle is repeated until the document 
level is reached and items are ranked for final output. Narrow searches 
expand only a few nodes per level and retrieve the most promising docu- 
ments rather quickly. They provide an inexpensive search with high 
precision and low recall. Broader searches expand several nodes per ‘ 
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level and secure higher recall, but at lower precision. Thus, by vary- 
ing the search strategy, it is possible to obtain a desirable cost- 
performance tradeoff: users wanting a small amount of relevant data 

have quick, economical searches while users wanting more comprehensive 
searches pay and wait accordingly. 

There are a number of variations of this general procedure. Forward 
search strategies allow only one opportunity to expand nodes on each 
hierarchy level, thereby constantly proceeding toward the documents. Some 
systems allow backtracking : that is, restarting the search on a upper 
level when an earlier expansion turns out poorly. The SMART system, for 
example, maintains a record of all nodes examined and in each instance 
expands nodes with the highest correlations. Unless special steps are 
taken, backtracking occurs in the natural course of events, A plunging 
strategy is a combination of an extremely narrow forward search followed 
with backtracking. Initially, only one node on each level is expanded, 
until the items in the first document cluster are examined and ranked for 
possible output. Next, a test is applied to determine whether the search 
should be halted, backed up one or more levels and continued, or com- 
pletely restarted. The test might involve statistical operations, showing 
documents to an on-line user, or other criteria. In any case, the search 
Involves narrow plunges to the bottom of the hierarchy. Generally, 
strategies which include backtracking are worthwhile only if there is a 
loose, flexible criteria for deciding how many nodes are expanded on each 
level. For example, it is useless to backtrack if an a priori decision 
is made to expand a fixed number of clusters. In this case a forward 
search strategy serves just as well, 
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Finally, there are a number of problems in measuring query-profile 
similarity. Consideration must be given to the influence of cluster size, 
hierarchy level, profile length, and the number and type of matching index 
terms ( The experiments in Chapter V deal with these and other factors 
since they relate to profile construction* 

For the most part, SMART'S cluster search scheme is appropriate for 
this investigation, since the experiments focus on profile definition and 
behavior rather than search strategies, SMART provides many options for 
contro lling the scope (broad or narrow), strategy (forward or backtrack), 
and matching procedure; additional details are given in Chapter IV and 
in a paper by Williamson, et al (10), 

5, Updating 

Additions to a clustered file are made by blending a new document 
into the cluster with which it has the highest similarity . The update 
process uses a new document as a request and makes a very narrow search 
of the file, recording each node that is expanded (update path). The new 
item is stored in a data or overflow area (see Chapter II) and logically 
linked to the profile hierarchy. If structured keys are used, the new 
item's key is simply one more than the key of the last document already 
in the cluster; linkage is established by increasing the degree of the 
parent (see Section III. 3. A), This blending update procedure is adequate 
for a time, but it subjects the hierarchy to the changes illustrated in 
Figure III-5a, In the case of cluster A, its profile continues to charac- 
terize both the original and new information; that is, it remains at the 
center of the cluster. The updated version of cluster B shows polarization 

o 

ERIC 




i 




ITI-22 




Original Clusters 
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a) Cluster Updating without Profile Modification 
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into distinct regions* Finally, cluster C is "enlarged" by the addition 
of a document which does not really belong in any existing cluster, but 
which fits best in C, In the last two cases, the profiles no longer 
accurately represent their respective clusters and to some degree the 
entire document classification is no longer a logical division of the 
collection* These intuitive arguments simply point out that in all 
probability file updating involves two steps* 

1) alterations to each profile on the updating path and 

2) final incorporation of the new document into a 
lowest level cluster* 

Even if profiles could be altered in an optimal manner the hierarchy 
continues to degenerate* Figure II I- 5b shows an example in which two 
clusters become polarized with addition of new documents* That is, their 
profiles move to one end of the cluster to represent the majority of 
documents, and leave the other items virtually unretrievable. No profile 
can adequately represent these clusters for searches* Furthermore, the 
example indicates the original classification is no longer a logical 
division of the data. In light of the new documents three clusters are 
preferable to the original two clusters. The cause of these problems is 
the use of a static hierarchy structure to represent a dynamic document 
collection. Clearly, new items change the character of the data base, and 
the hierarchy structure must change as well as the profiles. The solution 
to these problems requires some type of re-clustering whenever the file 
enlarges enough to cause a significant drop in retrieval performance 
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The Investigation of the updating -process has two almsi l) to 
examine profile alteration procedures to distinguish those which are 
effective and 2) to determine how quickly the hierarchy degenerates . 

That is. to find how quickly precision and recall decrease with increasing 
file size and thereby obtain an idea of when the file must be re-clustered . 
The experiments examine three options for altering profiles. The first 
completely re-makes the profile by re-weighting and adding new terms, and 
writing the new (probably longer) vector into another part of storage. 

This process results in an accurate profile, but requires a large amount 
of work and fragments whatever organization is used in storing profiles. 

The second option simply associates the new document with the lowest level 
node and leaves all profiles unchanged. A third option, useful with 
weighted profiles, alters only existing profile terms and changes their 
weights to reflect the addition of new documents. No new terms are added, 
thereby maintaining the original vector length and the storage sequence, 
since the new profile exactly over-writes its previous version. 

With an increasing number of additions to the file, the quality of 
the classification decays and it becomes desirable to re-cluster the data. 
Unfortunately, a complete clustering is an expensive procedure and cannot 
be undertaken too frequently. A most promising technique is partial re- 
clustering . that is, re-classifying only those portions of the hierarchy 
that experience significant growth. Under most circumstances many clusters 
receive few additions and therefore might survive for considerable time 
without re-organization. More volatile subject areas need frequent revi- 
sion. In order to implement partial re- clustering, the profile for each 
node should include the number of documents in its current crown as well 
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as the number of documents added since the last classification (update 
count). Whenever the ratio of update count to crown size exceeds a 
specified threshold, all items beneath the node are re-clustered, the 
update count it reset to zero, and a new update cycle begins. The experi- 
ments conducted in Chapter VII investigate parameters applicable to this 
procedure. 

The addition of new documents to a clustered file also increases 
search time since a new item is stored in an overflow area away from the 
rest of its cluster. Consequently expanding an upper level node may in- 
volve sever al disk accesses to fetch all the documents in its crown. The 
effort can be reduced, however, by re-writing the file periodically so its 
physical and logical sequence are the same. Although there is some 
expense involved, this procedure consists of moving data only, not re- 
structuring the file. For this reason, this problem is not considered in 
detail in later chapters. 

Another problem in large data bases is the handling of records which 
are old and essentially inactive. Nearly all information loses value with 
age; in document retrieval systems, this means that fewer users ask for 
or accept older documents even if they are relevant. For this reason it 
is reasonable to retire older items to archive storage-magnetic tape, for 
example. One way to accomplish this is to make periodic scans of the 
hierarchy and to remove all documents beyond a certain age. Naturally 
profiles must be adjusted to reflect deletions, and partial re-clustering 
may be advisable in cases of significant alterations. The retired docu- 
ments are maintained as a serial file for retroactive searching. The 
undesirable features of this scheme are the need for a complete file scan 



TIT-26 



and the loss of Information in making the serial file (i.e, the cluster 
relationships) , 

Another solution to the problem of aging files is the generation of 
time dependent clusters* To explain, assume that a 2 year period is chosen 
and that all information obtained in that period is grouped into lowest 
level clusters* Let the rest of the hierarchy be made by grouping 
cluster profiles in the usual manner. The result is a collection divided 
by subject and publication period (on the lowest level only). Updating 
occurs only in clusters of the appropriate period* This idea has several 
advantages. First, old information is easily identified and removed for 
storage elsewhere. Second, users limiting queries to recent material are 
able to eliminate search time spent on unwanted documents. Third, the 
structure of retired information is retained since entire clusters and 
their profiles are removed intact. The profiles can still be U3ed as a 
directory so the archive file merely becomes an extension of the disk* 
Fourth, vocabulary changes might be implemented across publishing periods 
by associating a dictionary with each period. For example, suppose a 
thesaurus is used in the indexing process and when the period ends, the 
thesaurus is reviewed and changed if necessary. Instead of re-indexing 
nil previous documents using the new thesaurus, the old thesaurus (or a 
list of changes from new to old) is stored along with all other data for 
that publishing period. As a result, only retroactive searches covering 
that period must have their queries indexed in terms of the old vocabulary • 
New documents and requests for more current information are processed 
using only the newer thesaurus and the data for the new publication period. 
Because the test collections are not large enough, it is not possible to 
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accurately Investigate questions relating to aging files and vocabulary 
changes. This discussion is included for completeness and to point out 
an area in need of further research. 

6. Hierarchy Storage 
A. General 

Search speed is one of the primary goals of any file organization. 
For a clustered document collection stored on disk and accessed by the 
indexed sequential method, search speed depends on the management of disk 
space the storage scheme for documents and profiles. The character- 
istics of the disk device (Table 1-1 for the IBM 2314) and the amount of 
system traffic (20) also influence retrieval speed; however, these factors 
are rarely under control of a system designer and, therefore, are not 
considered in detail here. In this research, management of disk space 
refers to the allocation of areas for index, data, and_overflow purposes 
within the Indexed sequential access method . Lum, etal(2l) consider 
many of these options including 

1) number of indexes and their placement; 

2) placement and blocksize of data; 

3) amount and placement of overflow (cylinder overflow, 
overflows located on the same or separate disk pack) . 

In the experiments related to hierarchy storage, it is assumed that any 
master and cylinder indexes are core resident as well as the track index 
for the current cylinder. Consequently, a search which crosses cylinder 
boundaries requires an extra read to fetch a new track index. Overflow 
space is limited to 10^ of the disk capacity, allocated as 2 tracks per 
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cylinder. 

The interesting questions concerning hierarchy storage deal with 
the method and order of record storage. This chapter introduces the con- 
cept of structured keys for Identifying and linking nodes and suggests 
three record storage sequences (subtree, heir-filial, and level). To 
init ially build the file or to re-organize it, records are sorted by key 
value, blocked, and written onto prime data areas of the disk. If each 
cluster completely occupies an integral number of tracks, there is no 
wasted space and each access obtains the maximum amount of useful data. 

In practice, cluster sizes vary a great deal and it is necessary to trade 
retrieval speed for wasted space. In each tradeoff, there are two opposing 
ideas to consider. The neighboring concept holds that minimal I/O delays 
occur if records in a filial set occupy the minimum number of disk tracks. 
Filial items are those accessed and processed together such as documents 
in the same cluster, sons of the same hierarchy node, and all profiles on 
level 1 (sons of a dummy node on level 0), The splitting concept holds 
that the amount of wasted disk space is minimized only by splitting records 
and filial sets across tracks to insure the use of every byte of storage. 
Obviously mi nlmizin g the search time wastes disk space and vice versa. 

The following storage algorithm attempts to maintain a time-space tradeoff 
by using a threshold for wasted disk space. It assumes full track block- 
ing (one physical record per track) and stores items in a specified 
sequence, 

B, A Disk Storage Algorithm 

The following disk storage scheme is used in the experiments in this 
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research. As mentioned above, It attempts to maintain a balance between 
space waste and search time. It requires that *-he order of record stor- 
age be known; that is, which record is first, which is second, etc. As 
items are stored, the necessary indexes for indexed sequential access are 
inserted whenever a cylinder boundary is crossed. The algorithm consists 
of three steps repeated for each filial set of records. 

1) Let T * track capacity, 

0 * threshold for wasted track space, 

E « remaining space on the current track, and 
S *» total size of all records in the current filial 
set, F. 

s 

2) Compute X = :£ = number of tracks to store F if a new 

track is begun and 



X? - — * number of tracks to store F if the 
present track is continued. 

3) Store F taking the actions specified below 
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The frequency of cases B and C depend heavily on 0j the frequency of case 
A is influenced by the size of the average filial set. After storing 
the entire file, the important evaluation criteria are the total amount 
of wasted space and the number of filial sets requiring an extra access 
(case C). 

C, Orders for Hierarchy Storage 

The final storage consideration is the order in v dLch records are 
placed on disk. Of the three sequences mentioned earlier (subtree, level, 
and heir-filial) and illustrated in Figure III-l, only two are really 
viable. Subtree order is easily discarded because it locates filial 
records in widely separated disk areas. Both the level and heir-filial 
sequences place filial records in close proximity. Order by level stores 
all nodes on a complete hierarchy level as a contiguous set of physical 
records. This scheme is most advantageous for broad searches since many 
nodes are expanded and their proximity reduces the motion of the access 
arm. As a bonus, order by level allows a hierarchical arrangement of 
memory devices. For example, the more active, upper level nodes might be 
all ocated to a drum while the less active, lower level nodes reside on 
moveable head disk. Heir-filial order provides for rapid narrow searches 
since the sons of a node are generally nearer their parent than with 
order by levels. However, I/O time for broad searches is high since 
unrelated profiles on the same level are separated by a moderate distance. 
Consequently, if several unrelated nodes on the same level are expanded, 
the disk arm jockeys back and forth among these locations for the rest of 
the search. 
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If search strategies are of the forward type, then choice of the 
optimal storage sequence appears to hinge upon the frequency of narrow 
and Inroad searches. This assumes, however, the existence of realistic 
strategies which take advantage of any economics offered by either scheme* 
The experiments in Chanter VII consider the applicability of the level 
and heir-filial sequences for forward search strategies* The storage 
algorithm described earlier is used to place records in a simulated file 
prior to actual request processing* An additional outcome of the tests 
is an estimate of the i/O activity Involved in a cluster search and its 
relation to precision-recall performance . 

7. Query Clustering 

To this point, the classification procedure has been applied only 
to document vectors. However, any type of data can be clustered, and 
query clustering has "been suggested as an alternate way of partitioning 
a document collection (13, 14). Implementation of query clustering re- 
quires saving requests and possibly a record of their relevant documents. 
Using this data, the query vectors are clustered just as though they were 
documents; therefore the profiles in the resulting hierarchy are combina- 
tions of query terms. Second, documents are associated with the lowest 
level query clusters using one or more of the following schemes. 

a) All documents relevant to one or more queries of a 
query cluster are associated with that duster. 

b) All documents highly similar to a query cluster profile 
join the corresponding cluster. 
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c) All documents highly similar to one or more queries of 
any query cluster join that cluster. 

Documents not meeting any of these criteria are clustered in the standard 
way. The search process using a query generated hierarchy is no differ- 
ent than with document clusters. The success of query clustering depends 
on the fact that new requests are more likely to he similar to previous 
requests than to documents. In view of the large difference between 
speaking and writing vocabularies, it appears plausible that users request- 
ing similar information may phrase their questions similarly so that the 
success of one user can be passed onto another. 

Further, by accumulating requests over long periods of time and by 
constantly placing relevant documents in the proper cluster, it may be 
possible to assimilate vocabulary changes in the system on a gradual basis. 
Early tests with query clustering have shown the technique to be promising, 
but the results are too few to be very conclusive. 

Finally, comes the idea of combining query and document clusters in 
the same system. Keen (15) describes a search procedure which tries query 
clusters first and then uses document clusters if the initial search fails 
or leaves the user unsatisfied. However, there is no real reason to main- 
tain two distinct hierarchies. The nodes of both trees could be combined 
and searched at the same time. The combined system might be more effective 
than either one used separately, 

8. Alternate Uses of the Hierarchy 

The expense of various classification procedures is indicated in 
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Section I II. 2. Whatever its cost, clustering Is justified only If 

1) the tine saved In searches over the hierarchy lifetime 
compensates for the clustering expense or 

2) special services or advantages are apparent. 

In the second instance, costs are spread among several applications or are 
justified by increased user satisfaction. This section discusses supple- 
mentary use 3 of a cluster hierarchy and introduces a set of experiments 
on automatic query alteration. It is not the case that these applications 
could not be developed using different methods, but having a clustered 
collection makes them more effective or more economical than might be 
possible otherwise, 

A. Suggested Uses 

Because of the classification process, the hierarchy provides a 
partition of the collection which is far from random. The aim, in fact, 
is to dusters in favor of retrieving relevant information. To some 

extent ff this requires grouping documents with, similar subject content. 

Most of the alternate hierarchy uses— browsing, query expansion and 
alteration, SDI, and content analysis— benefit from this type of division 
also. 

Viewed from its top, the hierarchy first separates literature into 
general and then more specific areas on successive levels. This is a 
natural struct ur e for allowing on-line users to browse among nodes, seek- 
ing new areas of interest and discovering new relationships among familiar 
topics. Depending on the mode of operation., profile terms might be dis- 
played along with substitutes, related terms, indicators of importance, 
pointers to related nodes, etc. Such information might be used to directly 
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access documents or to refine the keywords used in the current query 
formulation, 

A number of systems employ retrieval thesauri to broaden requests 
by replacing or supplementing keywords with class descriptors, thereby 
providing more opportunities for keyword matches. Current research is 
aimed at automatic thesaurus construction in lieu of the manual methods 
used previously (l6, 17), In many respects, thesaurus construction Is 
simply the process of finding highly co-occurring terms and making them 
into a single entity. Since clustered documents share a common vocabulary, 
the cluster is a natural starting place for identifying initial term 
classes (18) , In this way the most obvious and important local associa- 
tions are recognized first and tested later for their global applicability. 
Some of the problems caused by high frequency terms are lessened by this 
procedure. 

Syntactic and linguistic analyses also benefit from the fact that a 
cluster contains homogeneous subject matter. In a sense, the information 
within a cluster provides a context for interpreting meaning, assigning 
parts of speech, resolving ambiguities, etc. Since current techniques 
seem to work best in narrow subject areas, again it is appropriate to 
apply them after documents are placed in individual clusters. 

Procedures for selective dissemination of information (SDI) might 
also benefit from a clustered collection. Generally these systems store 
user interest profiles and compare them with all incoming documents. When 
a profile matches a document, notification is sent to the user so that 
he is made aware of the document’s existence. With a clustered file, a 



XU 

o 

ERIC 



i 



ITT- 35 



user's name is associated with one or more nodes of the hierarchy. He 
receives a notification each time one of his nodes appears on the update 
path of new documents. An advantage of this procedure is that new items 
are not compared with all user profiles. In addition, names can be 
associated with upper and lower level nodes, thereby indicating general 
or specific interests. Finally, only a slight amount of extra work is 
required to provide SDI since all the comparisons must be made anyway in 
order to update the data base. To identify which nodes are associated 
with a user, his original interest profile is used as a query and a file 
search is made to identify nodes which correlate highly with it. After 
adding his name to lists for the indicated nodes, the profile can be 
discarded and SDI proceeds automatically. 

B. Query Alteration 

Term classifications have been used by a number of researchers in 
attempts to improve performance by adding related index terms to documents 
and requests (16, 1 7, 19 » 22). In some of this work, statistical methods are 
employed to determine the best substitute for each vocabulary term. Then 
the base terms in individual vectors are augmented with substitute terms 
which are used in a wide variety of ways during matching. For example, 
substitutes might be employed in either requests or documents, or both. 

Or matches involving substitutes might be counted only if there are 
matches on their corresponding bases. Or matches might have different 
emphasis according to whether they involve base or substitute terms. With 
the proper options, an impressive degree of success has been achieved 
with these techniques in unclustered collections. 
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One difficulty in applying these procedures to large collections is 
the amount of computation for obtaining the base- substitute pairs , for 
example, preparing a similarity matrix. In this research, a set of sub- 
stitutes is associated with select profile terms for each node in the 
hierarchy. Specifically, base-substitute pairs are profile terms which 
have a maximum term-term correlation in those documents beneath the node 
under consideration. For the lowest level nodes, only a few terms and 
documents are involved and similarity matrices are easily calculated and 
stored. On upper levels, the required matrices are obtained by combining 
those on lower levels. These techniques greatly decrease the amount of 
effort in obtaining substitutes, especially when minor profile terms are 
removed from consideration, 

A further bonus from a substitute structure within the hierarchy is 
its close association with the document collection. The broad substi- 
tutes on upper levels are applicable to the entire collection while those 
on lower levels are concerned with local associations. Each set of 
substitutes is applied only as query searches enter the appropriate portion 
of the hierarchy. On any particular level, the substitutes brought to 
bear may be those from parent nodes (broader terms which expand the 
search) or those in the current profiles (narrower and more discriminating 
terms). In either case, the request expansion is temporary in the sense 
that a new substitute set is applied as the search descends the hierarchy. 

In addition, there are two general ways of relating base and substitute 
terms during the vector matching process (TIED and UNTIED options) . TIED 
substitutes have a conditional presence in the expanded vector in that they 
enter the correlation process only when there is a match on the corresponding 
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base. In a sense, their matches are "tied" to base matches. With the 
UNITED option, bases and substitutes are treated independently and both 
are always available for matching, much as in traditional query expansion. 
Finally, different emphasis may be assigned to matches on base terms and 
those on substitute terms. 

Figure III-6 illustrates these concepts on a small structure- 
showing the profiles, base-substitute pairs, and the use of substitutes 
in searching. Not all profiles contain the same terms, so naturally the 
substitute sets differ for each node. The pair (e,a) is always present, 
other pairs occur only once or twice. Some terms have the same substitute 
on both levels while others change substitutes. In the example, the 
scoring function is simply the number of matching terms. Without the 
use of substitutes, the request matches profiles P* and P w equally. 

Using substitutes from the upper level, the query expands to (a,b,d,£,f) 
and the matching favors node P" . Note that in this case, the tied 
option neglects the match on term g since there is no match on its base d. 
Two versions of the query are formed when substitutes on the same level 
are used* Again node P" is favored because of the extra match on term a 
due to its close association with term f . 

The experiments in Chanter VIII restrict themselves to the TIED 
matching option. The tests are further divided into those for increasing 
recall by using substitutes from previous levels (parents) and those for 
Increasing precision by using the substitutes for profiles on the current 
level . Several sets of substitutes are made with varying degrees of 
frequency restrictions applied to the terms. In all cases the purpose of 
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the experiments Is to show additional uses for the profile hierarchy and 
to help justify the expense of constructing a clustered file. 

9. Summary 

This chapter presents a general, tout comprehensive review of the 
construction, use, and maintenance of a clustered file. Clustering methods 
axe classified as generative or divisive and according to the amount of 
work they involve. Dattola's algorithm is explained in some detail since 
it is used extensively in the research. The description of hierarchy 
formation includes cluster size and overlap, linkage among nodes, and 
profile def ini tion. In particular, the standard profiles and rank value 
profiles are defined, illustrated, and compared. Next, various search 
strategies are explained followed by some ideas on how the hierarchy 
should toe stored in order to achieve a fast search. In discussing updat- 
ing, it becomes apparent that file maintenance involves changes to profiles 
as well as periodic partial reclustering as the collection grows. The 
possibility of query clustering is introduced as another way of achieving 
a viable document classification. The comments on browsing procedures, 
thp <g fl U T * us construction, SDI, etc. relate alternate ways of using clusters 
once they are formed. Of particular interest is automatic query altera- 
tion based on term-term associations found in each node's profile. 
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Chapter IV 



The Experimental Environment 

1*. Introduction to the SMART System. 

The principal tool used in this research is the SMART information 
retrieval system at Cornell University (l, 4). By virtue of its modular 
design and extensive facilities for gathering evaluation statistics, 

SMART is more than a simple document retrieval or text processing system. 

In reality, it provides a laboratory environment for testing the effective- 
ness of content analysis methods, search strategies, file organizations, 
and on-line procedures. The system makes available several static docu- 
ment collections wi.th corresponding query sets and a wide variety of 
processing methods and controlling parameters. Experiments are generally 
carried out by changing a single variable and making pairwise compararisons 
between retrieval runs. Some types of experiments suffer because there is 
no actual user population. However, the advantages of reproducible re- 
suits and a completely controlled environment more than compensate for 



A great deal of the SMART evaluation ic based on retrieving user- 
specified relevant documents. For the most part, the author of each 
request has carefully examined the collection and identified items which 
answer his question. These relevance judgments are recorded and used as 
a s tandar d for measuring the quality of output from experimental tests. 
Typically, the system correlates a request vector with all or part of a 
document collection and ranks items in order of decreasing similarity t 
that is, in the order they are to be retrieved. Using the relevance 
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judgments, the output Is scored according to how many relevant are found, 
at what rank positions, etc. Averaging performance measures across an 
entire query set provides at least one basis for comparing retrieval 
methods. There are a number of problems with this evaluation, but its 
strongest point is that the measures of retrieval quality are based on 
documents that requestors previously designated relevant and not some 
internal criterion. 

In the usual case, SMART automatically extracts the information 
content of a natural language document or query and represents it as a 
vector of weighted concepts. A concept is simply a numerical identifier 
for a word, word stem, or phrase that actually occurs in the text. The 
weight reflects the semantic importance of the concept and generally 
increases in proportion to the frequency of occurrence. Figure IV-la is 
a schematic of a vector of this type. The stored representation of a 
document is a condensed version of this vector containing only concepts 
with non-zero weights and augmented by header information and a small 

piece of retrieval data (Figure IV- lb) • 

queries, documents, and profiles all have similar internal formats. 

Any of these items could be considered a vector is an m-space, having 
one dimension for each vocabulary term. Given two such vectors-a 
document and a query, a number of functions might be used to measure 
their proximity in space and hence the desirability of retrieving the 
document. Experiments with SMART show that the cosine of the vector 
angle is a rather good, measure for detecting relevance, and this function 
is used extensively in this research (See section II.2.C and Figure IV-lc). 
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With the cosine similarity function, the retrieved items lie within an 
m-dimensional cone swept about the query vector. The half-angle of the 
cone is C0S“ 1 (K), where K is the cutoff established to distinguish 
retrieved and non- retrieved documents. 

2. The Data Collections 

The collection used in this study consists of the abstracts for 
140Q aerodynamics articles and their corresponding 225 requests with 
relevance judgments. Texts for both documents and queries are those used 
by Cleverdon's Aslib Cranfield Project (2). The SMART word stem analysis 
procedure was applied to the text in order to obtain search vectors. 

Specifically, each item is indexed by 

1) deleting words found on a restriction list, 

2) reducing morphological forms of the same word to their 
common stem, and 

3) weighting each stem according to its number of 
occurrences. 

The restriction list consists of approximately 360 prepositions, pronouns, 
conjunctions, and auxiliary and common verbs (See Appendix a). The auto- 
matic stemming procedure confounds words ending with common suffixes while 
taking into account doubled final consonants, changes of % to i, and the 
removal of silent e's. This processing results in a vocabulary of 5000 
words with a rank— frequency distribution similar to that of the well-known 

Zipf curve. 

For later reference, the distribution of document and query lengths 
is shown in Figure IV-2. In this context the length of a vector refers to 
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Its number of non-zero elements j that is, the number of index terms 
assigned to the corresponding text. Understandably, documents are 
longer than queries, averaging 54 terms as opposed to 9 for queries, A 
number of documents have in excess of 100 terms. The distribution of 
weights is just as important as length. Within documents, term frequen- 
cies range .from 1 to 27 occurrences, A great many terms occur once, a 
smaller number twice, and so on; the distribution appears to be almost 
- Poisson. Queries differ markedly in that 97# of all terms have unit 
frequencies within their vectors and the other 3# occur onl y twice * As 
it is, query terms might as well not be weighted according to frequency 
at all. These results indicate that users write short specific requests 
and omit background material that might really be helpful. This may be a 
genuine user preference or because instructions are not given to the 

contrary. / 

A fi nal statistic/ to report is the distribution of relevant docu- 
ments for each query (Figure IV-3) . With an average of ? relevant items 
per question, the collection generally is 1.2. Using all the averages 
mentioned so far, the typical query contains 9 terms of the same weight 
and aims at retrieving 7 relevant documents indexed with 54 concepts 
apiece. Obviously the task is a hard one, for even if all query terms are 
matched a document might have a cosine correlation of only 0.41. Under 
these conditions, random keyword matches are expected to have a noticeable 
influence on performance. Longer request formulations would help this 
situation, but unfortunately users do not seem to supply them on their 
own accord. 
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Distribution of relevant documents 
Cranfield Collection, 1400 Documents, 225 Requests 



Figure IV- 3 
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3. The Generated Hierarchies 

Three cluster hierarchies ware produced for the Cranfield documents 
using the Dattola classification procedure (3). All consist of three 
levels— two for profiles and one for documents— but their clusters differ 
greatly In size and overlap. Hierarchy 1 is the primary test case for 
this study 1 the others are used to confirm selected test results. Table 
IV-1 lists the properties of each hierarchy* in order to interpret this 
data, a few definitions will be reviewed. The crown of a node is the 
number of documents reachable from it along sill descendant paths. For 
a pa rti c u lar query, a node is relevant if and only if at least one rele- 
vant document is included in its crown. The values reported in the table 
are the average number of relevant nodes per level. Finally, overlap is 
defined as the ratio of the total number of leaves to the collection size 

(minus 1.0 and expressed as a percent) . 

The first two hierarchies are designed so that each document cluster 
fits onto a single disk track (approximately) 1 the third allows about 
two clusters per track. Comparing Hierarchies 1 and 3 in Table IV-1, 
both have about the same overlap and node degrees, although the latter 
contains nearly twice as many nodes. Hierarchy 2 has much more overlap, 
but only a few nodes on level 1 (with high degrees). As a result, its 
first level profiles characterize 450 documents (indirectly) rather than . 
50 or 100 as in the other cases. 

The number of relevant nodes per level provides a superficial evalua- 
tion of Dattola* s classification procedure without regard to searching. 

The table shows that level 2 clusters confine relevant documents to a 
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Property 


1 


Hierarchy 
2 3 


Level 1 (Profiles) 


Number of nodes 


13 


6 


28 


Average crown 


115 


446 


50 


Range of crowns 


60-201 


267-445 


22*81 


Average relevant nodes 


3.9 


3.3 


5.2 


Average sons 


4 


16 


4 


Average profile length 


812 


908 


526 


Level 2 (Profiles) 


Number of nodes 


55 


94 


103 


8 Average crown 


27 


28 


14 


1 Range of crowns 


10-55 


11-64 


4-25 


Average relevant nodes* 


5.3 


9.0 


5.1 


Average P_ profile length 


323 


311 


197 


Average P* profile length 


266 


• 




Level 3 (Documents) 


Total number of nodes 


1500 


2679 


1400 


Overlap 


7 % 


91# 


0# 


Average relevant nodes (unique) 


7 


7 


7 


Average document length 


& 


54 


54 


Range document length 


10-164 


10-164 


10-164 


♦Based on partial data 



Crown » number of documents reachable from a node 

Relevant node ■ a node whose crown contains one or more relevant 

documents 

Overlap • ratio of total nodes on level 3 to collection size 
(1400) less 1.0 



Properties of the Experimental Hierarchies 

Table IV-1 
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■infli i number of groups* but the level 1 clusters do not confine them a. 
great de»1 more. It is unfortunate that the algorithm is not able to 
place all the relevant for a query under a single first level node, 
especially since a typical query has only 7 relevant documents. In the 
present situation, the broader search strategies should expand 3-5 nodes 
on level 1 and 5-9 nodes on level 2. In some circumstances this destroys 
the economy of involving the first level profiles at all. To illustrate, 
suppose that all ievel 1 profiles occupy two disk tracks while all level 

2 prof ile s occupy 5 tracks. If 3 nodes are expanded on each level, then 
the search cost is probably 2 accesses for lev®]. 1 and 2 to 4 accesses 

for level 2, m aking a total of 4 to 6 accesses. Under these circumstances, 
it might be better to disregard level 1, just examine level 2, and incur 
a fixed cost of 5 accesses. Obviously the situation is improved in both 
cost and performance if the relevant are grouped more tightly. However 
in the experimental emironment, collection size constrains the number of 
nodes if a reasonable cluster size is maintained. In an actual collection 
of thousands of documents, there would be a great many more top level 
nodes and the economy becomes more apparent. 

Table IV-1 also shows the average vector length (number of index terms) 

for some of the standard profile types in each hierarchy. As large as 

the vectors are they do not strictly conform to the definitions of 

standard profiles (see Section III. 3) since the definitions generate 

vectors with more index terms than the software can handle. In or? ~ to 

perform any experiments, all terms with frequency 1 are removed in each 

P or P vector. Even so, on level 1 of Hierarchy 2 terms of frequency 

3 2 

3 or less had to be eliminated. At this point it is impossible to 
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substantiate that this deletion has negligible effect on results; later 
analysis and experiments confirm this assertion. Since profiles are 
based on term occurrence frequencies and Pg profiles use document frequen- 
cies, the initial term deletion results in vectors of different lengths. 
Consequently, two sets of unweighted (P^) vectors are possible, depending 
on whether Pg or P^ vectors are used as starting profiles. For example, 
given vectors, their unweighted counterparts are generated simply by 
setting all weights to a constant value. In an actual system, this storage 
space would be reclaimed. In any case these shortened, but otherwise 
undisturbed vectors are used as "standard" profiles throughout the study. 
The set of unweighted vectors will always be identified and consistent 
with other profiles used for comparison, 

4-. Evaluation 

This research is a study of profile definitions, uses, and modifica- 
tions, as well as an evaluation of clustered files in general. However, 
evaluation is complicated by the following three factors t 

1) changes in profile definitions may affect several 
evaluation measures (storage, search time, quality 
of search output), 

2) output quality varies with the search strategy, and 

3 ) Imperfect evaluation measures. 

The first factor points out the difficulty of selecting a "good" profile 
since It is improbable that a single definition maximizes all the desired 
measures , Output quality refers to the amount of relevant material 
retrieved and the order of its presentation. These quantities depend on 
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the number of profiles expanded on each hierarchy level (search strategy) 
which has a secondary effect on response time. Consequently, a "good" 
profile for one 6earch strategy may not perform well for another strategy. 
Hopefully, some of these difficulties are eased by the dual evaluation 
procedure adopted here. Both methods are based on external relevance 
judgments for the query set. In the first case, the standard SMART 
precision-recall computations are made for fixed search strategies. The 
second, new method is based on the content of clusters chosen for expansion 
rather than the final document order. Although the search is not completed, 
the method is independent of search strategy. Finally in nearly all 
cases, comparisons are made among runs with only one changed parameter. 

With these precautions, the general consistency of both evaluation 
methods, and the relatively large query set} reasonable confidence is 
placed in the conclusions drawn from the output, 

A, SMART Evaluation 

A SMART cluster search strategy Includes selections for parameters 
concerned with 

1) measuring query-profile similarities and 

2) deciding which nodes are expanded (l). 

The cosine correlation is used exclusively in this work although it has 
been suggested that query-profile similarities should be influenced also by 
the position of a node in the. hierarchy. Some of the results in Chapter 
V deal with this question. In order to decide which nodes should be 
expanded, SMART maintains a list of nodes examined on all previous levels 
and arranges it in order of decreasing similarity. At each expansion 
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point, the list is scanned from its beginning and nodes are expanded 
until one of the following criteria is meti 

1) the cumulative crown of the expanded nodes exceeds a 
preset maximum, 

2) the number of expanded nodes falls within a chosen 
range, 

3) correlations fall beneath a threshold, or 

4) the end of the list is reached. 

Several additional factors may influence the criteria also. In any case 

the sons of the nodes expanded are fetched, matched with the query, and 

1 

their correlations sorted back into the list to await further processing. 
In the test situation here, two fixed search strategies are used 
evaluation. The first is a narrow search designed to examine % of 
the documents and expand approximately 1 node per level. The second is 
a broader se arch looking at 10$ of the collection and exp an di n g 1 to 2 
nodes per level. The complete set of SMART parameters is given in Table 
IV-2. 

Given a search strategy, SMART examines the most promising profiles 
and produces a list of documents ranked in order of decreasing similarity. 
Suppose that the first k documents on the list are retrieved. Then for 
each query, the document collection is divided into four exclusive sets 
(Figure IV-4a) » 

1) retrieved and relevant (a documents) 

2) retrieved and non-relevant (b documents) 

3) non-retrieved and relevant (c documents) 

k) non-retrieved and non-relevant (d documents) 
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Narrow Search* 



Maximum cumulative crown (WANTED) 
Minimum nodes expanded (MINNOD) 
Maximum nodes expanded (MAXNOD) 
Correlation margin about expansion 
cutoff (EPS) 

Minimum correlation threshold (MINCQR) 



70 

1 

1 

•005 

.05 



Broad Search* 



Maximum cumulative crown (WANTED) 
Minimum nodes expanded (MINNOD) 
Maximum nodes expanded (MAXNOD) 
Correlation margin about expansion 
cutoff (EPS) 

Minimum correlation threshold (MINCOR) 



140 

1 

3 

•005 

.01 



♦All other parameters are 0, 



SMART Cluster Search Parameters 



Table IV-2 
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Relevant 

Documents 



N on-relevant 
Documents 



Retrieved 

Documents 

N on-retrieved 
Documents 




k “ a+b « number of documents 
retrieved 



a) Subdivision of a Document Collection After 
Retrieval of k Items 




b) Sample Precision-Recall Plot for a Cluster Search 

Figure IV-4 
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Note that the cutoff for retrieval is k * a+b. Under these conditions 
the search precision and recall are defined as* 




(IV-1) 

(IV-2) 



SMART evaluation consists of plotting averaged precision-recall data for 
a range of cutoffs as well as computing four global measures. For this 
research, P-R points (precision-recall) are calculated for cutoffs of 

S 

k «* 5» 10, 15, 20, 30, 50, 75 (the broad search includes k ■* 100). All 
possible cutoffs are specifically not used in order to avoid placing undue 
emphasis on the initial P-R values. From a user's view it makes little 
difference whether a relevant document is ranked first or third since a 
half doeen items are probably judged anyway. However, differences in 
early rank positions have considerable influence on P-R curves. By 
plotting points at spaced intervals, this bias is lessened. A sample 
curve is shown in Figure III-4b. From equations IV-1 and IV-2 it is seen 
that recall-precision values of 1.0 are perfect so that the better the 

performance, the higher the curve. 

The global measures calculated are normalized precision, normalized 
recall, rank recall, and log precision (5). The two normalized values 
approximate average standard precision and average standard recall, but 
actually measure the difference between the ideal performance (precision 
and recall of 1.0) and actual performance. Rank recall and log precision 
are somewhat simplified forms of the normalized measures. All of these 
values are strongly influenced by initial data points (ranks 1, 2, 3» • • •) 
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and are not easily corrected for this factor. However they are useful 
for condensing the graphical output into a few statistics for easier 
evaluation. 

Figure IV-5 contains the precision-recall curve and global measures 
for a cluster search using Hierarchy 1, profiles, and both search 
strategies described earlier. Corresponding points from a P-R curve for 
a full search are included as a basis of comparison for this and future 
searches. Considering only the relative positions of the curves and 
neglecting the amount of system effort (user cost) involved, it is easy 
to conclude cluster searching is vastly inferior to other methods. However, 
clusw searches are not intended to have the complete effectiveness of a 
full search. Their usefulness comes from flexibility; in an Inexpensive, 
narrow search a few relevant items can be obtained, and as the search 
broadens greater recall is achieved, A very broad search might completely 
neglect the upper levels of a hierarchy and, in the limit, become a full 
search. Search cost has been treated casually thus far whereas it is an 
important factor in comparing the results of cluster search strategies. 

As mentioned earlier, the number of disk accesses per query search is 
regarded as a reasonable cost measure , at least being proportional to 
both computing charges and on-line response time. Because SMART handles 
many queries in parallel and actually simulates cluster searches, it is 
impossible to obtain the number of disk accesses per query. The average 
number of correlations per level is known however, although this data is 
difficult to relate to accesses without knowing the details of record 
storage. Even if an accurate measure of system effort were available some 
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Symbol Description NR NP C(l) C(2) C(3> 



o 


Full Search 


.88 


.61 


0 


0 


1400 


A 


Narrow Cluster Search 


.62 


.35 


13 


8 


90 


□ 


Broad Cluster Search 


.6? 


.42 


13 


12 


l6o 



Legend: NR - normalized recall 

NP - normalized precision 

C(X) - number of correlations on level X 



Sample Precision-Recall Curve for Hierarchy 1 

Figure IV-5 
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subjective judgments are required to settle tradeoffs between cost and 
performance. All this simply proves that search evaluation is not a 
straight forward task. Fortunately, the tests ahead compare slml lar runs 
with approximately the same number of correlations per level. This 
greatly simplifies the job of rating various profile definitions, etc. 

The SMART evaluation procedure has a number of drawbacks of a mechani- 
cal and aestetic nature. First, averaging performance over a query set 
must provide for different numbers of relevant for each query | interpola- 
tion etc, (6), Second, because a cluster search does not examine all 

documents, it may not have the opportunity to retrieve all relevant 

# 

information. This condition results in an artifical upper bound on 
recall or a recall celling . In order to reduce the influence of this 
factor, P-R curves are not displayed beyond the retrieval cutoff. The 
global measures require ranks for all relevant items, however, and for 
this reason curves are extrapolated by assuming that unretrieved relevant 
documents would occupy random positions in the remaining output ( 7 ) • 

Third, precision and recall measure user satisfaction without regard to 
search cost. Attempts have been made to incorporate cost by changing the 
extrapolation procedure, but for the most part, system effort is recorded 
by the number of correlations per hierarchy level and these figures are 
simply associated with a P-R curve. Fourth, the evaluation process is 
quite dependent on search strategy, As a result, conclusions stated under 
one set of conditions may or may not apply to a different set of parameters. 
Finally, the cost of SMART search and evaluation procedures preclude 
examining a large number of search strategies. 
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In spite of these drawbacks, evaluation by means of document level 
precision-recall curves Is useful. In that It gives complete Information 
abcut a specific type of search and that curves of this type have become 
the standard measures in document retrieval systems, 

B, Cluster Oriented Evaluation 

This section develops a method for evaluating profiles based on 
their success of differentiating relevant and non-relevant nodes. The 
measures used — recall ceiling and precision floor — axe analogous to docu- 
ment recall and precision, but extended to clusters. They account for 
overlap and place different values on clusters due to their size or the 
amount of relevant Information they contain. From one point of view, the 
evaluation considers the retrieval of clusters of information rather than 
single documents. Accordingly, statistics are computed only after each 
cluster is "retrieved". 

Consider a hierarchy level with m nodes (Figure 1-2) , The broadest 
possible search strategy, examining all a nodes, involves all the distinc- 
tions to be made among these nodes under any condition. For query i, 
assume the nodes are ranked by decreasing correlation so that they would 
be expanded in this sequence. For the node ranked j»l, 2, ,,,,m, let 
j be the number of documents in its crown that are not present in the 
crowns of nodes with higher ranks. Similarly let be the number of 
relevant documents in its crown that have not been recovered previously. 

Note that c, . and r. . compensate for overlap as it is encountered in the 
0 0 k k 



sequence of ranked nodes. The sums 

« 

cumulative crown and cumulative number of relevant over k nodes. The 



c . 4 aad r, . represent the 

yi 10 >1 1J 



147 



i 



a 



IV--21 



quantity 



»1 -21 r . 
1 >1 J 



(XV- 3) 



is the total number of documents relevant to the query. Recall ceiling 
is the percent of all relevant that are recoverable sub.lect to the _ 



expansion cutoff j i 



RO ir 



r ll * r 12 *• • ± r 1.1 
u. 
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This is the highest possible document recall that is attainable for this 
search strategy* Obviously large values of RG^ ^ are preferred and RC^ ° 
1,0 is the optimal situation — all relevant documents beneath the first 
node. Deceptively high recall ceilings could be obtained by placing nearly 
all documents in a single large cluster and dividing the remainder into 
several small clusters. For most requests , the large cluster is expanded 
first and a high average recall ceiling is obtained. Viewing only this 
measure, it appears that good performance is obtained by examining one 
cluster. This is only partially true, of course. The clusters used here 
do not have skewed size distributions, but slight effects of size axe 
observable • Precision floor corrects fcr sl? e b3 .-v by meas uring the per- 

l 

cent of recoverable documents that are re-e a nt, subject to the expansion 



cutoff it 




g il ***** fy 
c il**** + e ± 2 



(I V-5) 



Precision floor represents the lowest possible document precision if all 



documents beneath the first j 
of are preferred j PF^ - 



nodes are retrieved. Again, large values 
1,0 indicates the ideal situation— all 
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relevant and no non-relevant beneath the first node. 



As an example of these measures, consider a hierarchy level having 
5 nodes with non-overlapping crowns of sizes 10, 10, 20, 20, and 30 
documents respectively. Suppose that a query with 8 relevant documents 
is correlated with the profiles and produces the ranking and evaluation 
statistics shown in Figure IV-6. The values are interpreted as follows. 

If only one node is expanded, regardless of what else occurs in the search, 
the highest possible document recall is 4/8 and the lowest possible docu- 
ment precision is 4/20 (assuming all axe retrieved). Similar statements 
can be made for other cutoffs. The accompanying graphs show the changes 
in and PF^ for varying expansion cutoffs. 

After processing n queries, the average recall ceiling and average 

precision floor are computed as follows! 

n 



“j ’ n 

i“l 

B i " n i>lj 
1-1 



(IV-6) 



j c If 2f • • • f n 



(IV-7) 



These are actually macro averages in that they average the individual 
performance statistics for each query of the set (6) , Other averaging 
methods could be defined also, but are not used here. Both RG^ and 
PFj can be plotted separately or together along with some measure of 
system work to provide a performance curve for one level of the hierarchy 
Figure IV-7a is a hypothetical plot of RGj versus the number of clusters 
expanded. All such curves are non-decreasing and achieve a maximum of 



RC j ■ 1,00 for j^n. In the ideal situation every query has all its 
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a) Example of cluster oriented evaluation statistics 





b) Graphs of recall celling and precision floor for the above 
query. 



legend 



query i — 8 relevant documents 

C ■ number of additional documents In the cluster 
ranked j 

r - number of additional relevant documents In the 
^ cluster ranked j 



RC 

PF 






recall ceiling If j nodes are expanded 
precision floor if j nodes are expanded 



Figure IV-6 
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O Experimental 



Expansion Cutoff 



Q Ideal 



.) Hypothetical Plot of Recall Ceiling Versus Expansion Cutoff 



1.0 - 



, 8 - 



.6 



A m 



.2 



Precision Floor 



O— 






I 



O-COooo 



— I 1 1 1 1 — 

.2 A .6 .8 1,0 



Recall Ceiling 



O Experimental 



□ Ideal 



b) Hypothetical Plot of Recall Ceiling Versus Precision Floor 
Plotted at various Expansion Cutoffs 

Figure IV-? 
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relevant In the crown of the top ranked node so RC ^ ■ 1*00 for J ■ 

1, 2, ..., n. Hence, the aim is to raise experimental curves to a hori- 
zontal line. Figure IV-?b is a hypothetical plot of PFj versus RC^ 
drawn at various expansion cutoffs. With this curve, the ideal case arises 
if for every query, only relevant documents are found in the crown of the 
top ranked node. Hence - PF^ • 1.00, and the ideal graph is a vertical 

line at the extreme right of the scale. 

Measuring system work (search time and cost) is a non-trivial task, 
involving the selection of a work unit and problems associated with, measure 
ment. The number of disk accesses per search is probably the best unit of 
work. However measurements made under this condition depend on the 
characteristics of a specific storage device, order of hierarchy storage, 
blocking factors, etc. As a result, disk accesses are too specific a 
unit except where these factors are controlled. The SMART evaluation 
suggests measuring work by the number of query comparisons made with docu- 
ments profiles. The Inaccuracies here are twofold. First, size 
differences among data vectors on various levels indicate that not all 
comparisons incur the same cost. Second, this unit ignores the economy 
from storing vectors in adjacent locations. The cluster evaluation scheme 
measures system effort by the number of nodes expanded in the search. This 
quantity is device independent and emphasizes the economy from storing 
items adjacently. A fixed number of accesses— one or two— might be 
associated with each node expanded in order to provide conversion to other 
units. However, the varying number of sons per node is neglected and 
obviously, more work is involved in expanding a node with many sons than 
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with a few sons. In spite of this final difficulty, the number of nodes 
expanded is used as the measure of system effort In the majority of cases 
in this study. Basically, it Is assumed that averaging over many nodes 
and many queries minimizes the effects from variations in the degrees of 



nodes. 



Given two profile definitions P^ and there must be some agreed 
method of using the recall ceiling and precision floor measures and curves 
to determine which definition is superior. The following rule is used for 
this purposes ? A is said to be superior that P^ if the average values of 
recall ceiling and precision floor for P A are greater than the correspond- 
ing values for P fl . Symbolically this Is stressed ass 




< rc 9b 

(pf^b 



(IV-8) 



for j ■ 1, 2 x^ m. 

Note that the values of j may be restricted to the initial ranks since 
RC - 1.0 in all cases. 

HI 

So far, evaluation considers a single level of the hierarchy and sill 
of its nodes. However, an actual search generally accesses only part of 
the nodes on any one level. Still, it is reasonable to use all nodes in 
evaluation since this includes the full set of items to be distinguished 
under any search strategy. Examining multiple subsets of nodes is possible, 
but turns the evaluation into an undesirable combinatorial test situation. 



This is really unnecessary since it can be shown that if P A >Pg bolds for 
all nodes on a level, then it also holds for a majority of subsets of 
these nodes. For any particular subset S, either 
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l) P a >P b within S or 
Z) P B >“P A within S. 

However, since P A >'P U for the entire level, then the first case must he 

A O 

more prevalent among all possible subsets. As a result, fox 

the majority of search strategies. 

The remaining concern is the interaction among all levels of the 
hierarchy. The evaluation considers each level independent of the rest. 

Given two profile definitions P A and P fi , is it possible to have a con- 
tradiction such as P a >P b on one level and that P fi >* A on another 
level? Although possible, this situation is highly Improbable if the 
profile definition is at all reasonable-assigning term weights which 
are non-decreasing with the number of keyword occurrences. Further, 
since profiles on higher tree levels are composites of those on lower 
levels, it is even more difficult to realise the contradiction if a 
profile definition is consistently applied. Lastly, if a contradiction 
occurs it is fairly clear that neither profile has a strong superiority 
over the other. Both probably perform about the same# 

As mentioned either, the ideal case arises when - 1 »°* 

This situation occurs if for all queries* 

1) the classification isolates all relevant and no non- 

relevant in a single cluster and 

2) the single relevant cluster always ranks first when 
profiles are matched with the query vector# 

In practice neither goal is achieved* therefore the best achievable 
performance curve lies beneath the ideal curve. Specifically, Table IV— 1 
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shows that Dattola'8 classification algorithm resulted in 3 to 5 relevant 
nodes on level 1 and 5 to 9 relevant nodes on level 2. Under these 
conditions, best performance occurs if for each query i, the profiles 
are always ranked in a way that maximizes all partial sums 



This ranking is best in the sense that the greatest number of relevant 
documents are retrieved for the fewest expanded nodes (least amount of 

-w . . . __ 

work). Figures IV-8 and IV-9 show the best achievable performance for 

e 

the first and second levels of the hierarchies used in this study. 

Although cluster-oriented evaluation faces the drawbacks of incom- 
plete searching and unsure relations between hierarchy lev els, the method 
has several advantages. First, it is Independent of expansi on cutoffs 
and some parameters of various search strategies. Second, it accounts 
more accurately for system effort (hence, search cost ) and leaves user 
effort as a secondary consideration. Third, it exam ines only a small part 
of the retrieval process rather than attempting to measur e effects across 
an entire search. Presumably, the latter technique obsc ures some experi- 
mental effects in its across-the-board measurements. Finally, cluster 
evaluation is quite economical compared to actual searches . Both evalua- 
tion methods are used where appropriate so that conclusions are drawn 
with a substantial degree of confidence. 
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Hierarchy 1, 13 nodes 
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Hierarchy 2 , 6 nodes 
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Hierarchy 3, 28 nodes 





Best Achieveable Performance Curves— Level 1 



Figure IV-8 




156 



i 



IV- 30 



! 



Precision Floor 




Legend 

O Hierarchy 1» 55 nodes 

A Hierarchy 2, 96 nodes 



Best Achieveable Performance Curves— Level 2 



Figure IV-9 
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Chapter Y 

Profile Experiments 
1, Introduction 

' The previous chapters provide the background for the experiments In 
this and succeeding chapters * Previous chapters contain discussions on 
the structure and use of a clustered file# the basic profile definitions# 
search methods# storage organization# updating techniques, and other 
areas. The experimental environment, description of the collections and 
hierarchies, and the evaluation methods were also covered. 

The present chanter presents the results of _an extensi ve set of 
experiments focused on profile definition . Particular attention is given 

to* 

a) the performance of the standard profiles (P -^,2 ? l 

b) the effects of rank value weighting; 

c) bias in search results; 

d) profile length; 

e) frequency considerations; and 

f) tradeoffs among unweighted# partially weighted, or 
fully weighted terms. 

For the most part# the work uses Hierarchy 1 and cluster-oriented evalua- 
tion (see Chapter XV). The most promising techniques are thoroughly 
tested using all hierarchies and both evaluation schemes. Final conclu- 
sions are based on the complete set of results, 

A summary of the major conclusions includes the following. 

a) Profiles superior to standard or rank value vectors can 
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be made by using term weights based, on frequency ranks 
(not rank values) . The resulting vectors are free from 
correlation domination and other biases. 

b) A large portion of the low weighted profile terms may be 
deleted without a large performance loss . In fact, 
deletion improves the performance of unweighted profiles, 

c) Unweighted profiles give somewhat inferior search per - 
formance . but partial weighting schemes may suffice 
instead of fully weighted profiles, 

A number of secondary conclusions related to cluster size, biased search 
results, and frequency considerations are brought forth also. 

2. Standard Profile Performance 

. % 

In order to review the standard profile definitions, consider a 

node whose crown is the document set C « { D l- D 2 D n} . Then, the 

standard profiles are 

P, ■* D, v D. v. . ,v D where D. is an unweighted vector, 
l JL 2 n l 

+,,,♦ D n where is an unweighted vector, and 

+ Dg D q where is a weighted vector. 

Terms in P^ profiles are unweighted while those in P^ and profiles are 
weighted according to document frequencies or total occurrence frequen- 
cies within C, In Chapter IV it was explained how each profile is 
obtained from the clustered Cranfield collection including the necessity 
of eliminating the lowest frequency terms. In order to describe other 
profile properties, the following concepts are introduced. The size of a 
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profile P is the number of documents in its crown? it3 length is the 
number of index terms in Its vector? and its magnitude |p| is the square 
root of the sum of squares of its term weights* In the case of P^ 
vectors, each term is assigned a unit weight. The properties of the 
standard profiles for Hierarchy 1 are given Table V-l. 

Evaluation curves for the standard profiles are shown in Figures 
V-l and V-2. At least three observations can be made. First weighted 
profiles -perform significantly bet ter than unweighted _ongs 
(p >p t p >»p ) # Later examination shows that the results for un- 
weighted vectors are biased so that small clusters unfairly achieve high 
ranks, regardless of their relevancy. Some of this bias can be removed 
and performance improves considerably. A second observation is. that 
term weights based on document frequency ap pear equivalent, if not, slight- 
ly superior, to weights based on total term o ccurrence (using .jjlthin j ioc u - 
ment frequencie s), i.e. P 2 > ? y This is a surprising and pleasing 
result since it indicates that an existing document collection without 
weights can be clustered and searched without performance loss due to 
profiles. If a large performance difference had been observed, an un- 
weighted collection would have to be re-indexed with weights in order to 
obtain maximum benefit from the clustered organization. The final obs e r - 
vation is that of a slight -performance ad v antage for the shorter ^-, 
vectors over the longer ones . The effects of vector length in unweighted 
profiles is discussed in Section V.7. Actually the standard profiles 
differ in so many ways it is impossible to draw conclusions from these 
tests. The curves are presented as a reference for later experiments 

involving fewer variables. 
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Property 


Nodes on 
Level 1 


Nodes on 
Level 2 


Number of profiles 
Average size 


13 

115 


55 

28 


P Profiles 

’ (term frequency weighting) 






Average length 
Range of lengths 
Average magnitude 
Range of magnitudes 

• 


812 

438-1302 

590 

340-971 


323 

120-692 

162 

74-304 


P Profiles 

z (document frequency weighting) 






Average length 
Range of lengths 
Average magnitude 
Range of magnitudes 


722 

397-U75 

291 

169-477 


266 

9>580 

84 

37-145 


P_ Profiles 

A (unweighted, made from Py 






Average length 
Range of lengths 
Average magnitude 
Range of magnitudes 


812 

438-1302 

28 

22-36 


323 

120-692 

44 

11-26 


P. Profiles 

x (unweighted, made from Py 
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Average length 
Range of lengths 
Average magnitude 
Range of magnitudes 


722 

397-1175 

27 

20-33 


266 

93-58O 

16 

10-24 



Properties of P^ P 2 , Profiles for Hierarchy 1 
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(unweighted, made from Pg) 



.20 



30 



o 


p 3 


Profiles 


A 


p 


Profiles 


O 


p l 


Profiles 


□ 


h 


Profiles 



Evaluation of the Standard Profile Definitions 
Hierarchy 1, Level 1 



Figure V-l 
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Recall 

Ceiling 
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Profiles 


(terra frequency weighting) 
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^2 


Profiles 


(document frequency weighting) 
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P ! 


Profiles 


(unweighted, made from P^) 
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p i 


Profiles 


(unweighted, made from P^) 



Evaluation of the Standard Profile Definitions 
Hierarchy 1, Level 2 

Figure V-2 
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Because many curves similar to Figures V-l and V-2 are presented, a 
few remarks about their characteristics are in order. First, points are 
plotted at cluster cutoffs in all cases so that the i point represents 
the precision floor and recall ceiling obtained if the search expands i 
clusters. Roughly speaking, the same amount of system work (e.g*, number 
of disk fetches) can be associated with the first, second, third, etc., 
points on all curves for a given hierarchy level. Second, the PF scale 
varies considerably between levels while the RC scale is the same. This 
is due to the substantial difference in the size of the profiles on 
various levels and the dependency of PF on profile size. Third, more 
performance differences are generally observed on upper hierarchy levels. 
This is caused by the nature of these vectors— longer, more extreme 
weights, greater magnitudes, etc. 

3 , Rank Yalue Profiles 

Rank value profiles derive term weights from frequency ranks rather 
than frequency counts. Given a P 2 or P^ vector, its terms are ordered 
by decreasing frequency and re-weighted by assigning them rank values. A 
rank value is the difference between a base value and the position of the 
term in the frequency ranking. Chapter III illustrates rank value pro- 
files aru? points out their differences from standard profiles; namely 

a) all vectors have the same high weight rather than the 
same low weight and 

b) the range of terra weights is considerably reduced 
since weights are derived from frequency ranks. 
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The following experiments examine the selection of a base value and 
these differences. The results indicate superior performance can be 
achieved through rank value weighting. However, ,iiaproyements__a re_dueJto 
chan ges in the physical •properties of the vectors rather than to factors 
Intrinsic to the rank values themselves . 

A. Base Value Selection 

Suppose a rank value profile P has k index terms with frequency ranks 
from 1 to r=gk (terms with the same frequency share the same rank). If 
the base value is b»r, term weights range from w q » b-r to w & « b-1. 

The quantity w q is the weight origin (lowest value), w & is the weight 
apex (highest value), and w^ — w^ ■* r is the weight range . As mentioned 
above, keeping the base value constant for all profiles assures that all 
vectors have the same apex. This contrasts with the standard profiles 
which all have the same origin. 

In Doyle* s work (l), the major criterion in base value selection is 
assurance of a positive weight origin in all profiles. However, the 
base value influences cosine correlations and search results and there- 
fore should be chosen carefully. To illustrate, consider a rank value 
profile P - (P x . P 2 P y ) with a base value b such that w q « 1. In- 

creasing the base value to b' » b + a (a>-o) is equivalent to increasing 
all term weights by a constant and forming 

P* « P + A where A * (a^, a^, ..., a y ), 

a «* b* - b if p. / 0 
a. = < (V-l) 

1 [o If = 0 
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The addition vector A is actually an unweighted profile whose unit weight 
is a. Correlations involving P* can be expressed as follows* 



This equation indicates that the total correlation is approximated by a 
linear combination of two other correlations— one from the original 
weighted profile P and the other from the unweighted profile A. Note 
that a and 3 depend only on |P| and a « b* - b. Further, as the base 
value increases, w — ♦>00 • w — ►oottt — ► 0, and £ —►1. As a result, the 

O cL 

unweighted correlation dominates the total and performance approaches that 
of unweighted profiles. 

The effect of increasing the base value can also be viewed as making 
terms less distinguishable during the correlation process. For example, 
the terms of the profile P - (2, l) using b = 3 contribute to correlations 
in the ratio 2*1. That is, a match on one term is worth twice as much 
as a match on the other. Raising the base value to b* - 11 yields P* « 
(]_0, 9 ) whose terms contribute in the ratio 10:9* The relative importance 
of terms is reduced so that correlations differ only slightly depending 
on which term is actually matched. The same effect occurs in large 
vectors also, namely an increase in base value "smears” the importance of 
the weights assigned to individual terms. In the profiles for these 



COSCQ.P*) » C0S(Q,P + A) 




{l - 2 03 [l-C0S(P,A)J j (V-2) 

& acos(Q,p) + 3 cqs(q,a) 



where 



“ “ |P + At P “ 1 " “ " I? + A| 
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experiments, term weights represent frequencies j using a low base value 
maintains frequency distinctions while a high value decreases their 
importance. 

Figures V-3 and V-4 compare search performance in Hierarchy 1 for 

several sets of ran’ value profiles made from P 2 vectors (document fre- 

; 

quency weighting). The resu3.ts are generally as predicted by equation 
Y-2, namely decreasing performance with increasing base value. This 
supports the idea of maintaining the distinctions apparent in the original 
term weights to whatever extent possible. The single exception to these 
conclusions occurs in Figure V-3 for the lowest base value (66), One 
explanation for e I= 55 Is that too small a base value places 

unwarranted importance on frequency, rank as a retrieval indicator. How- 
ever, other data shows that the P^ n gg performance is strongly influenced 
by a single, large cluster which nearly always ranks high regardless of 
its relevancy. Some evidence of this situation lies in the fact that the 
abscissa (RC) values are nearly identical for both curves while their 
ordinate (PF) values differ markedly because of cluster size (see equa- 
tion XV- 5) . It is not the case that an equal number of relevant documents 
could not be retrieved, but that they are recovered from clusters of 
vastly different sizes. The exact nature of this bias is discussed further . 
in Section V A. 

The evidence shows that rank value profiles perform, better when they 
rely on a small base value rather than a large base value . Since the base 
value is selected prior to profile construction, it is difficult to 
determine a value which is low, but not so low as to jeopardize performance 
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Symbol 


Base Value 


Auex 


Lowest Origin 


V 


66 


65 


2 


o 


86 


85 


22 


A 


100 


99 


36 


o 


226 


225 


162 


P 


CX> 


(unweighted) 




Search Performance as a 


Function of Base Value 



Rank Value P 2 Profiles, Hierarchy 1, Level 1 



Figure V-3 



V-12 




O 26 25 2 

O 100 99 76 

□ 00 (unweighted) 



Search Performance as a Function of Base Value 
Rank Value P^ Profiles, Hierarchy 1, Level 2 
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as in Fi gur e V-3 (b » 66). Fortunately, once the profiles are made, 
they can be adjusted via Equation V-l to any desired base value. The 
experiments in the following sub-section show how to eliminate the 
entire problem of bass value selection. 



B # Weight Origins and Apexes 

Using the same base value throughout a hierarchy or level causes 
all profiles to have the same weight apex while their origins vary. For 
example if b - 21, the hierarchy might contain these profiles: 



P - (18,20,19) 

p'n (20,16,18,15, 12, 19, 16,17, 13*1^) 



w * 18, v •= 20 
0 a 

w « 12, w « 20 
0 a 



The previous experiment supports, to some extent, the notion of maintain- 
ing maximum differentiation among profile terms in rank value profiles by 
keeping the base value (and hence weight origins) low. A logical exten- 
sion of this idea is to artlflcally reduce the weight origins for all 
vectors to the same low value. Mote that this does . n et produce a standard 

P or P vector since profile term weights axe still based on frequency 
2 3 

ranks. 

Figures V-5 and V-6 compare the performance of rank value profiles 



with variable weight origins (fixed apex) and similar vectors with a 
fixed weight origin (variable apexes). The experimental profiles are 
constructed from original P 2 *-type vectors (document frequency weighting) 
and are designed so that the lowest origin is the same in all tests. The 
best previous curves are included also. The figures show no significant 
difference between good profiles with variable origins and profiles with 
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D Fixed origin w 0 ■* 2 

O Variable origin Base value “ 86 Lowest origin * 22 

V Variable origin Base value « 66 Lowest origin - 2 



Comparison of Fixed and Variable Weight Origins 
Rank Value V 2 Profiles, Hierarchy 1, Level 1 



Figure V-5 
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Fixed origin 
Variable origin 



w 0 

Base value 



2 

26 



Lowest origin ° 2 



Comparison of Fixed and Variable Weight Origins 
Rank Value P 2 Profiles, Hierarchy 1, Level 2 

Figure V-6 
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a low fixed origin# This is not unexpected in level 2 vectors, where the 
change in origin produce only small changes in tern weights. However, 
most level 1 profiles have their term weights reduced by a considerable 
amount, yielding vectors whose correlations are more sensitive to 
individual term weights (see Section V*2#A,)# As a result, the size 
• Mn« noted earlier is removed and performance improves. The importance 
cf these te 3 ts is that they show rank value profiles with a fixed, 
minimal weight origin provide equivalent or bett er performance than 
similar profiles constructed using an optimal b ase value .(variable 
origins). Consequently, base value selection need, not be considered. ljLP. r ff£*Ag 
construction , The new construction process simply sorts the index terms of 
an initial vector (P 2 or P^) in increasing frequency order and assigns 
weights equal to ranks in the sorted sequence. These vectors are denoted 
by p* or P*, depending on the initial vector# 

* J 

C# Weight Range 

The starting points for rank value profiles are standard vectors 

(P or P ) whose term weights are frequency counts. Given such rank value 

2 3 

profiles, the previous experiments suggest reducing their weight origins 
to a minimal constant in order to improve performance (P* or P* vectors) ) , 

The difference between these final profiles and the standard profiles is 
simply that term weights are ranks of documen' frequencies rather than 
frequency counts. Figures V-? and V-8 compare the effectiveness of rank 
weighting and count weighting for both document and term frequencies. 

In all cases P 0 and P* curves are connected by solid lines while P^ 

Cm & 
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A 


Document frequency counts 


P 2 


D 


Document frequency ranks 




O 


Term frequency counts 


P 3 


O 


Term frequency ranks 


*5 



Comparison of Profile Term Weights Based on Frequency 
Counts and Frequency Ranks— Hierarchy 1, Level 1 

Figure V-7 
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A 


Document Frequency Counts 


P 2 


□ 


Document Frequency Ranks 


**2 


O 


Term Frequency Counts 


P 3 


0 


Term Frequency Ranks 


P 3 



Comparison of Profile Term Weights Based on Frequency 
Counts and Frequency Ranks— Hierarchy 1» Level 2 

Figure V-8 
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la order to interpret the curves, first suppose that a fixed decision 
is made to weight profile terms either according to frequency counts or 
ranks. Then the evaluation shows no consensus on a preference for docu- 
ment or term frequencies, that is, P 3 and P* P*. This result also 

supports the findings of Section V.2, In a similar manner, consider^ 
fixed decision to base all profile term weight s either__q^dq^en^rj^ 
frequencies. In this case, the test s suggost that weights based onjranks 
are superior to those based on counts , that is, P£>P 2 axui P 3 *' >P 3 * — 

major reasons for this performance Improvement are the altered profile 
characteristics, specifically a reduced w eight range and _its_eff ect . on 



the cosine correlation, as shown 'below. 



The magnitude of a profile vector and the presence or absence of a 
few term matches may greatly influence the final cosine correlation. Given 

a profile P - (p r P 2 P y ) , where PjL is the weight of tern i, and a 

similar query Q - (q r q 2 ,.... matching terms with weights v ± and \ 
contribute, to the total correlation, the amount t 




For any particular query, the values of q^l^l are fixed and variations in 

contributions are due to PjL /|P|. Figure Y-9 is a plot of the contr i bu ti o n 

ratio, p /|P|t for all unique term weights found in typical profiles from 
i 

Hierarchy 1, Two versions of each vector are shown, one having weights 
based on frequency counts and the other using frequency ranks. This type 
of curve is called a correlation contribution curve, and is used frequently 
in this study. The curves for frequency counts, show that about 3^ of all 
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5 
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Profile Type 



P 3 
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Term Frequency 
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Term Frequency Ranks 
Term Frequency Counts 
Term Frequency Ranks 



Cosine Correlation Contribution Ratlos—Hlerarchy 1 



Figure V-9 
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ter ms have very large contribution ratios and control retrieval in the 
sense that their matches practically guarantee expansion of the correspond- 
ing cluster. This control is due to the fact that 

a) large term weights increase |P| a great deal and result 
in large correlation contributions; 

b) small term weights change |P| very little and axe 
relegated to small contributions; and 

c) high contributions are obtained at the expense of low 
ones since their total is bounded (MAX ^COS(P,Q) ^ * !) • 

Vlthout high weight matches, a considerable number of other terras oust 
match in order to expand a cluster. This, however, is rather uncommon 
since queries gene rall y contain only a few index textts (an average of 9 
in the Cranfield collection) , This power of a few high weight profile 
terms to influence correlation and, hence, search outcome is Called 
correlation dominance . The contribution curves for profiles with weights 
based on frequency ranks show much less domination. In these vectors, 
the range of weights is much smaller, |p| is smaller, and the correlation 
ratios axe more evenly distributed. Frequent terms are still more import- 
ant than non-frequent terms, but they no longer dominate since it is 
easier for a number of other terms to influence cluster expansion. Th§ 
reduced correlation domination is the factor leading, t o the performance 
improvement noted in Figures V-7 and V-8, Although frequency ranks are 
used here, later experiments show that domination can be reduced by other 
methods also. The ranking scheme is simply convenient and maintains 
weights which are non-decreasing with frequency . 

Another implication of these results is th at the importance_Of ,an 
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Index term for retrieval does not increa se lin early with frequency i_b u t 
In a more gradual way . While previous experiments show the validity of 
increasing frequency distinctions among terms, by shifting to minimal 
weight origins, it is clear that extreme distinctions (domination) must 
he avoided. Term weights based on frequency ranks benefit because they 
meet both criteria. For these reasons, the P* and P* profiles provide 

the best performance encountered in this research. 

/ 

Since the P* and P* profiles appear throughout this study, it is 
appropriate to give a complete example of their construction. Figure V-10 
contains such an example and includes a comparison with the standard P 2 
and P vectors. The sample data is the same as that used in Chapter in 
for purposes. As shown in the figure, the starting point is a 

standard profile whose term weights are simple frequency counts. First, 
all terms are ranked in increasing frequency orders that is, the least 
frequent term receives rank 1. etc. This contrasts with Doyle's scheme of 
using decreasing order. Note that terms of equal frequencies share the 
same rank. Second, the final profile is made by replacing each original 
term weight with the term rank established in the previous step. As 
discussed earlier, this process reduces the range of weights in the pro- 
file, the vector magnitude, and the range of correlation contributions 
from matching terms. This is particularly true for P* vectors. These 
factors decrease the amount of correlation domination and lead to the 
performance observed earlier. 



D. Summary 

This section consists of an investigation of rank value profiles 
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P S Profile (Weights based on document frequency ranks) 
Original Profile P 2 - (1,0, 2, 0,0, 5, 3, 0,0,2) 



Frequency ranks 
P* Profile 



12 4 3 2 

II II I 

P* - (1,0,2,0,0,40,0,0,2) 



Magnitude 

✓4 3 




Contributions of Term Matches to the Total Cosine Correlation 

Contribution Matching Terra 

Vector 12 l_Jj 5 678 9 10 

Pg/lPgl .15 0 .30 0 0 .76 .46 0 0 .30 

.1? 0 .34 0 0 .69 .51 0 0 .34 



Pj ij Profile (Weights based on terra frequency ranks) 
Original Profile P^ «* (1,0,5,0,0,10,4,0,0,5) 



Frequency Ranks 
P* Profile 



I I II 



p* - (1,00,0,0,4,2,0 




Magnitude 

/l6? 



/j8 



Contributions of Terra Matches to the Total Cosine Correlation 

Contribution Matching Term 

Vector 1 2 34 5 67 89 10 

Pj/IPjI .08 0 09 0 0 .77 .31 0 0 .39 

P5/|PS| .16 0 .48 0 0 ,64 .32 0 0 M 



Construction of P* and P* Profiles 
Figure V-10 
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and their differences from vectors considered previously. The results 
include the following points, 

a) Low, hut not minimal base values are preferable to large 
bass values, an optimal, choice being difficult to find , 

b) The problem of base value selection can be eliminated 

by using a fixed, minimal weight origin for each profile . 

c) Term weights based on frequency ranks are superior to 
weights based on frequency counts . 

These properties are present in the P* and P* vectors which generally 
yield performance superior to either the standard profiles or rank value 
profiles. The modified profiles are actually variations of the latter 
types. Their only difference from standard vectors is the use of term 
weights based on frequency ranks. The change from a rank value profile 
is the use of a fixed, minimal weight origin for each vector rather than 
a global base value. In any case, all remaining experiments consider the 
use and characteristics of P* or P* vectors unless explicity stated 
otherwise. 

Ut. Search Bias 

A, An Algorithm for Detecting Bias 

The cluster search process consists of matching a query with all 
profiles on the first level of the hierarchy, ranking them in correlation 
order, and selecting several nodes for expansion. The process is repeated 
until document vectors are reached. There is a natural curiosity about 
the properties of profiles which occupy the initial portions of the 
ranking on each level and thereby become expanded, Specifically, it is 
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desirable to determine whether these profiles have consistent properties 
of length, size, magnitude, etc. If a damaging bias is detected, then 
steps should be taken to correct it. For example, the discussion of the 
experiments on base value selection mentions a bias toward large clus- 
ters in an informal manner. This section formalizes this concept and 
develops an analytical procedure for detecting search performance biased 

by a particular profile property. Using this technique on various types 

/ 

of profiles reveals the negative Influence of a bias in unweighted and 
certain types of rank value profiles . 

The analysis procedure starts with data from a given search— so 
many relevant and non-relevant clusters ranked above a chosen cutoff on 
each hierarchy level— and calculates the expected participation of each 
profile in achieving this performance. It then examines the actual 
participation of each profile and notes deviations from expected values. 
Patterns of large deviations of behavior denote bias. In general, good 
recall ceiling— precision floor resulst are accompanied by little or no 
bias. As with cluster-oriented evaluation (RC-PF) , the bias analysis 
considers each hierarchy level separately; the arguments for and against 
this approach are given in Section IV. 4. 

For a detailed description of the analysis method, consider a 

single hierarchy level with K profiles ^P^, Pg PjcJ a collection 

of J requests. Let there be a total of relevant clusters and «§LN^ 
non-relevant clusters where 

a) is the number of requests for which is relevant and 

b) is the number of requests for which is non-relevant. 
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Consequently, I * / J 1 is approximately the average number of rele- 

vant clusters per request so that if the expansion cutoff I is used, then 
on the aver age t all relevant profiles could occupy rank positions 1, 2, 

I, In actual tests, relevant profiles ranking I or above are said to 
perform well while non-relevant profiles ranking I or above are said to 
perform poorly. Prom an actual search then, the following data is 
collected* 

/ 

c) r^, the number of queries for which P^ is relevant and 
ranks I or above and 

d) n lt the number of queries for which is non-relevant 
and ranks I or above. 

Looking at a specific profile the ratio r^,/^ is its relative fre- 
quency of good performance while is its relative frequency of poor 

performance. Over the entire set of profiles, values for these ratios 
which differ markedly from expected values may indicate biased results. 

Figure Y-lla contains sample data for 5 profiles and 20 requests. For 
example, there is a total of 2^ * 40 occurrences of relevant clusters 
distributed as 2, 5, 7, 12, and 14 occurrences among individual clusters. 
Therefore a typical request has I - [2V J 1 «= 40/20 = 2 relevant clusters. 
This data is fixed and unchangeable for the given collections. In an 
actual search, suppose 0 2 9 relevant profiles rank 1 or 2 with these 
occurrences being distributed 1, 3» 5» 9» and- 11 among the individuals. 

Taking simple ratios r^/R^ gives the indicated frequencies of good performance. 

A similar explanation applies to the non-relevant items. Because of the 
way the data is listed, it is easy to observe some correlation between 
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cluster size and either performance ratio n^/N^ or r^/R^ . Without know- 
ing how much these ratios differ from their expected values, it Is 
impossible to say whether the search results are biased. The following 
analysis examines this situation more closely. 



profiles ranked in the first I positions. There is no a priori reason 
for P^ to be more prevalent in these positions than any other non-relevant 



Note the expression is independent of the profile under consideration, 
A similar calculation for the ratio r^/R^ yields 



The latter expression represents the true expected value only if all rele- 
vant clusters contain the same ratio of relevant to non-relevant documents 
for all queries. In practice, cluster sizes and the number of relevant 
documents in them vary a great deal and the above condition does not 
hold. Consequently, the conclusions of forthcoming tests are based 
primarily on the behavior of non-relevant profiles (n^/N^) and supported 
by the behavior of relevant profiles (r^/R^) , 

In either case, the important quantities are the deviations of 
experimental values from expected values, namely (n^/N^)-E n and 
(i'i/r^-E^. Figure V-llb shows these deviations for each of the 



Overall, any particular profile is non-relevant for N^/ ^~N . of 



3 



all queries. Furthermore, P^ accounts for n^/ ]]!> n . of all non-relevant 



profile, so the expected value of n^/N^ isi 




(V-'O 




(V-5) 



o 




ERLC 



i 



Number of profiles, K *» 5 
Number of queries, J * 20 
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a) Sample Data Related to Biased Searches 
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b) Deviations of Performance from Expected Values 

Figure V-ll 
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Amount of Deviation 



Amount of Deviation 




Average cluster size 




Number of items in average 



c) Relating Cluster Size and Performance Deviations 
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d) Examples of Unbiased Search Results 
Figure V-ll Continued 
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profiles in the example. Figure V-llc provides a way of observing search 
results biased with respect to cluster size because it lists the average 
size of all clusters whose profile behavior falls within various deviation 
intervals. The second number in each square is the number of items used 
in the average. Thus in the deviation interval [E^-.l, E^), profiles Pg 



of 25 (average cluster size) and 2 (number of items). Three factors about 
the columns of Figure V-llc indicate the search results in this example 
are biased by cluster size* 



In this instance bias can be defined by saying that the search procedure, 
correlation function, and profile properties are such that larger clusters 
are more likely to rank 1, 2,..., I regardless of their relevancy. 

Each of the three considerations listed above is an important factor 
in determining the presence of search bias. Figure V-lld is an illustra- 
tion of data lacking some of these characteristics. Column A, having a 
small range of deviations is an example of unbiased profile behavior. 
Column B shows a wide range of performance deviations, but no bias because 
the large range is due to the peculiar performance of one profile. In 
other words, there is a narrow distribution of entries within the range. 
Finally, column C shows both a large range of performance ratios and a 
wide distribution, but lacks a trend in the column of property values. 

In this case, bia 3 probably exists, but it is not related to the property 
under consideration. 



and behave such that E n ~.l 




a) a large range of deviations, 

b) a wide distribution of entries throughout this range, and 

c) a trend of decreasing cluster size in the column entries. 
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B, Investigations of Biased Searches 

The above procedure is used to analyze searches made with the stan- 
dard, rank value , and modified profiles described earlier. On level 1 
of Hierarchy 1, there is an average of 1=4 relevant clusters, so rank 
positions 1, 2, 3, and 4 are of primary interest. The corresponding 
figure for level 2 is 1=6. In all instances cluster size is the property 
investigated, although other properties might show the same results since 
large clusters generally have profiles with many terms, large magnitudes, 
etc. Cluster size is used because it is invariant among the types of 
vectors examined. Figures V-12 to V-15 show the behavioral characteristics 
of rank value profiles in Hierarchy 1. The vectors are the same as those 
in Section V.3.A, having rank value term weights based on document 
frequencies. The following observations can be made* 

a) unweighted profiles (infinite base value) show a definite 

bias in favor of small clusters; 

b) decreasing the base value reduces the bias significantly; and 

c) for the lowest base values, there may be a slight bias 
in favor of large clusters. 

Of great importance is the fact that reduced bias with lower base values 
is accompanied by the RC-PF performance improvement seen in Figures V-4 
and V-5. It is also possible to explain why a base value of 86 is super- 
ior to a base value of 66 in the case of level 1 profiles. Figure V-12, 
shows that the largest cluster (size 201) appears in the initial rank 
positions^ as a non-relevant item much more often in the test run with a 

base value of 66. Hence the expanded profiles (1, 2 i) frequently 

include this cluster, so that all relevant items retrieved must be sifted 
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out of a large set of other documents. As a result, the values of preci- 
sion floor measured are significantly lower even though the recall ceiling 
values might bo the same in both cases. This is the exact effect noted 
in Figure V-4. 

To complete the current study of search bias, Figures V-l6 to V-19 
contain a similar behavioral analysis for the standard profiles (P^ P 2 » 
?^) and their counter parts using rank weighting (P*, P^) • The two 
versions of unweighted profiles, one based on P^ vectors and the other 
on Pg vectors, show a large bias in favor of small clusters. In all 
other cases, practically no bias is present and it is difficult to draw 
further general conclusions. If attention is focused on the behavior 
of non-relevant profiles on a particular level and on the range and 
distribution of entries in the columns, then it is possible to detect more 
or less bias when term weights axe based on document frequencies. 

However, there is no consensus among levels. It is interesting to note 
the peculiar, consistent behavior of one vector (size « 13*0 in Figure 
V-16, Unfortunately there is no simple explanation for its behavior such 
as its profile covering a large number of dissimilar documents in a 
"loose" cluster. A similar phenomenon occurs for a few items on level 2, 
Another interesting point is the comparison of behavior of the best rank 
value profiles and P* and P* vectors. Significantly less bias is noted 

C, J 

with the latter pair and correlates with the PF-RC performance improve- 
ment in Figures V-6 and V-7. 

The study of biased behavior has a direct influence on the SMART 
system which includes factors such as cluster size and hierarchy level 
in determining a query-profile "correlation'*. These experiments indicate 
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that It is possible to derive profiles (Pg, P*. P*) which give un- 
biased search performance. Hence, at least for matching items on the 
same hierarchy level, there is no need to include any factors related to 
cluster size . Given unbiased profiles, it might be possible to deter- 
mine the influence of hierarchy level in a similar experiment which 
mixes the nodes from adjacent levels. In any case, there is a wide 
variety of properties that could be examined using these techniques. 

In summary, this section develops an analysis method for identifying 
cluster searches which are biased in some manner. The method determines 
the expected participation of each profile in establishing the search 
result { bias shows up as a pattern of large behavioral deviations which 
are related to a specified profile property. Applying this technique to 
tho vectors used earlier provides the following information concerning 
bias due to cluster size. 

a) Unweighted profiles and rank value profiles made with a 
large base value show a strong bias in favor of small 
clusters . 

b) Reducing the base value or using a fixed weight origin , 
reduces bias considerably . 

c ) Within the same hierarchy level, P^, P y P %« and P* 
profiles show unbiased performance jind therefore require 

f 

no art if leal corrections to the cosine matching function* - 

as used in the SMART system. 



o 



The tests also substantiate the intuitive notions of bias in the previous 
section and help to explain the results observed there. In all cases thus 
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y 



far, there is a strong correlation between unbiased profiles and good 
RC-PF performance, 

5. Profile Length 

By any of the previous definitions, a profile is an aggregate of 
Index terms found in clustered documents. As such, their vectors become 
longer as cluster size increases. Table V-l shows that in the first test 
collection, level 1 profiles average 700-800 terms to characterize 115 
documents and level 2 profiles uses 200-500 terms for 28 documents. These 
figures do not include the initial deletion of terms with unit weights 
mentioned in Chapter IV, Obviously, considerable disk space is required 
to store the profiles. In fact, on the IBM 2314’ disk unit, level i 
vectors occupy 7 tracks, level 2 vectors need 13 tracks, and all 1500 
documents use only 59 tracks. In this case storage overhead fo Y * the 
clustered organization is 34# of the file size and about 35# of this 
(all vectors on level l) must be accessed to begin a search. Therefore, 
in order to make a clustered organization useable in a practical sense , 
it is imperative to reduce vector lengths in some manner. The following 
experiments show a large number of terms can be deleted without severely 
sacrificing; search performance . 

A particularly simple scheme for reducing vector lengths is to 
choose a threshold 0 and delete all terms with weights less than this 
value, just as in the case of making the “standard" profiles. This pro- 
cedure is justifiable since the deleted terms occur only a few times in 
their respective clusters and certainly did not cause cluster formation. 

In addition, terms with low weights contribute small amounts to 
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correlations, so that presumably the search results remain relatively un- 
changed, Furthermore, since the majority of profile terms have low 
weights anyway, the length reduction is great even if a very low threshold 
is used. Figure V-20 makes this fact clear by showing the distribution 
of term frequencies for the same profiles depicted in Figure V-9. In 
combination, the figures show that at least 71# of the profile terms have 
individual correlation contributions of less than 0,03 regardless of the 
weighting scheme used. 

If the above deletion scheme is adopted, a reduced form of profile 
P «=* (p 1§ P 2 , p y ) can be represented by 

P* - P-A 

A “ (a x , a 2 , a v ) where = 

The vector -A is called the deletion vector . Substituting its value into 
equation V-2 gives an approximation to correlations using the reduced 
profilet 

cos(q,p')« 5 a cos(q,p) - 3 (cos(q,a) 

(v-?) 

IpI . IaI 

Hhere a - f» ” (? ' + ”| 

When Q * A, the maximum correlation loss IaI/IpI occurs. However, queries 
have characteristics quite different from profiles and it is more meaning- 
ful to consider the correlation loss due to specific deleted index terms. 
Section V, 3,C, (equation V-3, in particular) establishes that the correla- 
tion contribution of a term (weight p^) is proportional to p^/lPl, Since 
6 is the minimum weight retained, the maximum loss per term is bounded 



p i lf p i <0 

0 if p,$5 e 



(V-6) 
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toy e/|p| „ (Let e/lp I be known as the loss ratio whereas values of Pj/lPl 
are contribution ratios .) This suggests several strategies for obtaining 
an appropriate cutoff for each profile. For example, a "constant" 
strategy might establish a maximum tolerated loss ratio L and delete terms 
such that (p 1 /lP I ) ■< L. In other cases, the tolerated loss ratio might 
depend on the distribution of contribution ratios ( illustrated in Figure 
V-9). Specifically considered in Table V-2 are cases in which 0 is a 
function of the mean y and standard deviation CJ of all unique values of 
Pj/lPl in a given profile. 

With the test collections at hand, it is difficult to prove the 
superiority of one of these strategies. In all cases 0 varies among pro- 
files and in the last two instances, perhaps, its value is more sensitive 
to the characteristics of individual vectors. The constant strategy 
guarantees that correlations do not change too much for each deleted 
terms the others simply assure that a reasonable amount of "correlation 
power" is left in the vector. In spite of all this, the basic question 
is not one of strategy, but one of showing that a considerable number of 
terms can be deleted without damaging search performance. In this investi- 
gation, the third strategy is used and the parameter 6 is changed in the 
tests on P* profiles for both levels of Hierarchy 1. To review, these 
profiles have term weights based on term frequency ranks and provide the 
best performance, thus fax. Applying the term deletion strategy to the 
profiles considerably reduces their length as shown in Table V- 3. Their 
corresponding PF-RC performance curved are shown in Figures V-21 and V-22. 
The results definitely indicate that a large portion o f low weight profile 
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Deletion Tolerated Deletion 

Strategy Loss Ratio Cutoff 

1, Constant 

2, Fraction of Mean 

3, Deviation from Mean 

Notes lP{,y,o are obtained for each profile separately? 

L and 6 are parameters chosen for profile generation 

Profile Term Deletion Strategies 
Table V-2 
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Search Performance After Deletion of Profile Terms 
P* Profiles, Hierarchy 1, Level 1 
Figure V-21 



o 205 

ERIC 



/ 



V-48 



* 




Deletion 

Symbol Parameters Average Length 



0 


6 “ - 00 


323 


(100$) 


o 


6 = - 3/2 


171 


(5$) 


□ 


6 - - 1 


70 


(22&) 


A 


6 - - 1/2 


37 


{ll%) 



Search Performance After .Deletion of Profile Terms 
P* Profiles, Hierarchy 1, Level 2 

Figure V-22 • 



o 



zee 



V-49 



terms can be deleted with only a small change In performance . This con- 
firms the earlier assertion that the ini tial term deletion to make 
" standard" profiles does not alter performance substantially, in some 
ways, the ability to make extensive deletions is a pleasing result since 
it implies that the storage overhead and search time can be reduced con- 



siderably, For example, with a deletion parameter of 6 = -It level 1 
profiles occupy 2 disk tracks and level 2 profiles occupy 4 tracks, making 
a storage overhead of 7$, Also, using 6 e -1 results in a performance 
drop of about 1 % - 3/£, somewhat less than the improvement found between 
p* and profiles. In other ways, it is unsettling that so many terms 
can be deleted with so little effect. If it were known that the best 



possible profiles were being used, then large deletions would not be 
bothersome. On the other hand, all of the profiles considered need a 



great deal of improvement in order to reach the best achievable perfor- 
mance (Figures IV-8 and IV-9) and if such improvements wer© made, large 

deletions might yield disastrous search results. 

In Lit of sky’ s experiments (2), a profile includes only terms which 
are common to all documents in its crown. During clustering, unweighted 
vectors (P^) are used; however once the hierarchy is made, profiles are 
reformatted starting at the lowest level and proceeding upward. In the 
reformatting, all profiles with the same parent (filial profiles) have 
their common terms removed and retained only in the *ent profile. 
Applying this procedure to all vectors removes many terras and decreases 
the storage overhead for the file. To accommodate the altered structure, 
the search strategy is changed so that it follows all paths which match 
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the query in any manner* This is an appropriate technique since each 
profile term applies to all information beneath it, 

Litofsky presents no evaluation data related to his profiles , but it 
is possible to approximate these vectors and to evaluate them using the 

I 

SMART procedures. Specifically, the highest frequency terms in each 
SMART profile are eliminated in order to simulate their transfer to parent 
profiles. Using the algorithm developed earlier, the deletion cutoff for 
each profile is determined from its distribution of term correlation 
ratios. A cutoff 0 = ( p + Y? )|P|is applied and all terms with weights 
larger than 0 are removed. This procedure guarantees that the deleted 
terms are common to most members of a filial profile set and therefore 
are the terms likely to be removed in Litofsky* s original scheme. 

Obviously the deleted terms are different in each vector. One additional 
difference, of course, is the use of weighted profiles in these experi- 
ments. In this regard, the cutoff limits the affected terms to those 
with about the same degree of significance in all subordinate information. 
This is an important factor in interpreting the evaluation curves in 
Figure V-23. In all cases, the test vectors are of the P*(6 = -l) 
variety; the high weight deletion parameter is set at y “ 1* 0n level 1» 
the additional deletion removes an average of 15 terms, each occurring in 
3 of the 13 profiles. On level 2, an average of 6 terms are removed, 
each occurring in 1,7 of the 4.2 sons in a filial set. 

The large performance loss observed in Figure V-23 leads to the 
following observations. First, ma.ior profile terms cannot be removed 
from some parts of the hierarchy and retained in others, especially within 



o 



208 



V- 51 




Deletion 

Symbol Level Parameters Average Length 



0 


1 


6 = -1, v “ 00 


141 


(100$ 


0 


1 


6 » -1, y B i 


126 


m 


A 


2 


6 B -1, v 0 00 


?0 


100^ 


P 


2 


6 = -1, Y - 1 


64 


91^ 



Performance Loss Due to Deletion of High Weighted 
Profile Terms, Hierarchy 1 , Profiles 

Figure V-23 . 



209 



V-52 



the same filial set . All parts of the hierarchy must he treated alike 
its shown in the following example. Consider two profiles P and P', of 
which the first contains term T and the second has T elevated to its 
parent node. Now consider a query containing T which passes through the 
upper hierarchy via some search path and which is matched with both pro- 
files. The query matches T in P, hut not in P* even though the term is 
probably more characteristic of the latter cluster. In the case of P', 
the match is recorded on some previous level and this information might 
be carried along in the search. However, this places a great emphasis 
on the problem of relating the importance of T in the parent to its 
importance in individual subordinate profiles. This leads to the second 
observation, namely that individual weights of major profile term s are 
important in differentiating profiles from al l other vectors t filial or _ 
non-fllial . In practice most profiles contain a great many common terms 
(e.g., 70 %) and their weight distribution is the primary differing 
characteristic. Under Litof sky's scheme, common term occurrences are 
coalesced in the parent in a way that obscures the important weight dif- 
ferences in subordinate profiles. The easiest and probably best solution 
to this problem is to avoid altering the profile in the beginning. Where- 
as Litofsky's scheme may be viable for simple, unweighted profiles, it 
is definitely to be avoided in more sophisticated systems. 

To summarize, profiles can be subjected t o considerable deletion of 
low weight (less frequent) terms wit h little change in the quality of 
search output . A number of procedures are suggested for deriving a dele- 
tion cutoff which guarantees that correlations remain reasonably undisturbed. 
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Experiments with one method (0 • (y - 0 ) • |P I ) indicate that the deletion 
of 80 % of the lowest weight terms drops the RC-PF measures only 1 % to 3#. 
On the other hand, an attempt to remove or combine related occurrences of 
high weight profile terms results in much poorer performance . Such pro- 
cedures are to be avoided, 

6. Frequency Considerations 

Up to the present, finding an adequate profile has been handled as a 
problem of indexing a "super document" composed of the clustered documents. 
Using this analogy, the P^ P 2 and profiles are extensions of conven- 
tional indexing techniques. This research suggests modifying the standard 
definitions a bit in order to achieve better results. In all cases, 
however, the importance of a term to retrieval— that is, the amount its 
match contributes to the total cosine correlation— is non-decreasing with 
frequency. Many retrieval experts dislike using a monotonic relationship 
between frequency and importance. Consequently, this section considers 
profiles in which ter m Importance (correlation contribution) is not 
monotonic, but first Increases and then decreases with total frequency . 
Contrary to expectations, performance decreases under these conditions and 
the monotonic relationship is established as a better approximation to the 
true association . 

As background, consider the work of H, P, Luhn selecting terms for 
an indexing vocabulary for a set of documents. Hypothesizing a relation- 
ship between frequency and retrieval Importance such as that in Figure 
V-24, Luhn argues that words with high or low frequency have little 
significance and therefore, can be eliminated ( 3 ), Recent experiments by 
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Lutin' s Hypothetical Relationship Between 
Retrieval Significance and Tern Frequency 



Figure V-24 
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Bonwlt and Aste-Tonsman measure the relationship between retrieval results 

and the statistical properties of a collection's vocabulary (4). Among 

» 

other things, they concur with Luhn that generally the most discriminating 
terms have mid-range frequencies. 

If the mid-frequency terms are the best descriminators in the entire 
collection vocabulary, and therefore are the most important keywords, then 
a logical progression is to extend this concept to the mini-vocabularies 
of individual profiles. That is, given the profile index terms and their 
frequencies, the largest weights should be assigned to terms with the 
middle frequencies. Consequently, the correlation contribution ratio 
Pj/lPl from a matching profile term increases and then decreases as the 
term involved has greater frequency. The distribution of contribution 
ratios for some of the profiles used in this section are shown in Figures 
V-25 and V-26, The method for producing the increasing-decreasing 
behavior is given below j for now it is sufficient just to note the general 
shape of the curves. 

So far, this research has not contradicted Luhn's original concept 
as extended to individual clusters or the mini-vocabularies in their pro - 
files. Clearly, many low frequency terms can be eliminated without much 
effect on retrieval. Deletion of high frequency profile terms was tried 
in the previous section and produced very pOor results. However, a 
jgmwn number of very common words (a, the , as,,,,) are removed prxor to 
experimentation and these are probably. the terms Luhn would eliminate on 
the high end of the frequency spectrum. 

The remaining task is to evaluate the effectiveness of profiles 
whose mid-frequency terms provide the largest correlation contributions. 
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Contribution Ratios Resulting from Bending 
Hierarchy 1, Level 1, Node 5» P^C^ " -l) Profile, 155 Terms 

Figure V-25 
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Contribution Ratio 
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Max. Weight 
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£ «* 1 
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Contribution Ratios Resulting from Bending 
Hierarchy X, Level 2, Mode 14, « -l) Profile, 62 Terms 

Figure V-26 
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This condition is approximated in the following experiments by reweight- 
ing terms to "tend” the normal monotone contribution curves into an 
increasing, then decreasing shape. The input vectors are of the 
(p*(6 m variety so weights are based on frequency ranks and some 

3 

term deletion has occurred. Briefly, a maximum allowable weight 6 is 
established for each profile; and any larger term weight 0 + x (x>0) 
has its value lowered to 6 - x. A specific description of the re-weighting 
algorithm for an input profile P » (p^, P v ) as follows* 

a) calculate the mean y and standard deviation oof all 
unique values of contribution ratios p^/|P| ; 

b) determine the maximum allowable weight (or bend point) 
as 6 “ (y 4* 5 c) |P| where 5 is a constant parameter 
chosen in advance; 

c) re-assign term weights according to 

p. if p 4 -< 0 



.2 6 - P, if P ,>0 



Figures V -25 and V-26 show the original and altered contribution curves 
for typical profiles (5-00, 1, 0 . 5 , 0 ). In all cases, the altered pro- 
files assign the most importance (largest contribution ratios) to terms 
of middle frequencies. The parameter 5 controls the number of affected 
terms. 

Evaluation curves for the altered profiles (Figures V -2 7 and V-28) 
suggest that performance drops steadily as the contribution curve 
receives greater bending ( t and 0 decrease) . Thus, it appears that 
term weights within individual profiles should not decrease with - 
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Search Performance Resulting from Profiles with Increasing- 
Decreasing Contribution Curves-Hierarchy 1, Level 1, P ^6 - -l) Profiles 



Figure V-2? 
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Search Performance Resulting Prom Profiles with Increasing- 
Decreasing Contribution Curves-Hierarchy 1, Level 2, P*(6 “ -l) Profiles 

Figure V-28 
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frequency, but should Increase or remain constant . It must be said, that 
part of the performance loss is due to the fact that the characteristics 
of each profile determine its degree of modification. Thus, a term may 
have its weight substantially altered in one case and remain unchanged in 
another. On one hand, earlier tests advise against different treatment 
of individual profiles. On the other hand, strict adherence to this rule 
neglects natural variations in term importance due to different cluster 
contents. So, even though Luhn and Bonwit and Aste-Tonsman indicate that 
the mid-frequency terms are the best discriminators within a collection 
vocabulary, this concept does not carry over to the indexing of individual 
vectors. In the latter situation, high frequency terms have at least as 
much significance as mid-frequency terms. At the very least, term weights 
within Individual profiles should be non-decreasing with frequency, 

7 , Unweighted and Partially Weighted Profiles 

From the start, unweighted profiles (P^) demonstrated poor search 
performance and results which were biased in favor of small .clusters. In 
spite of this, unweighted profiles merit additional attention because of 
their simplicity and storage economy. The storage considerations are not 
minor, in the SMART system at least, where weights require as much memory 
as index term Identifiers (a S/360 halfword each). This section describes 
a number of attempts to correct the performance deficiencies of unweighted 
profiles while retaining their other advantages . 

To review, Section V ,4 develops a technique for detecting search- 
results which are biased with respect to a specific profile property. 

Tests made on unweighted profiles show such searches favor the retrieval' 

21S 



of small clusters regardless of their relevancy. A straightforward scheme 
to remove bias Is to alter the cosine matching function to give larger 
correlations when large clusters are involved* Specifically , given a 
profile P of size S (number of documents in its crown) and a query Q, a 
modified cosine value is computed from the equation 

MCOS(Q,P) - S T [C0S(q,P)] (V-8) 

where i is an experimentally determined constant* Figure V-29 shows the 
effect of using the MOOS function on unweighted profiles for Hierarchy 1, 
level 1; the P* curve is included for comparison. Figure V-3O shows the 
data related to biased behavior in the searches* Combining this informa- 
tion yields the following observations. First, the modified cosine does 
improve performance slightly as t increases and at the same time reduces 
bias toward small clusters* It is doubtful, however, that much additional 
improvement can be made by increasing t further. Second, the behavior of 
p profiles with t° ,2 is an example of unbiased, bit poor search 

Jm 

results. Earlier tests show that good performance is free of bias) 
obviously the converse does not hold. Third, even though Figure V-30 
shows no strong size bias for i >0, bias may exist with respect to 
another profile property. This is a likely situation since the range of 
deviations from is large and has a broad distribution of entries. 

Additional investigation of bias in unweighted profiles led to the 
f ol lowin g Insight and experiment. Because the documents are grouped, some 
terms in each profile are more characteristic of a cluster than others. 

In fact, this work has indicated that a large number of terms are completely 
unnecessary. However, an unweighted vector assigns equal significance to 
all terms; in a vector of length k all terms have a correlation 
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O pf” unweighted x * *2 

A P* weighted 

Performance of Unweighted Profiles Using a Size Dependent 
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Behavioral Characteristics of Non-Relevant Profiles 
Unweighted Vectors, Size Dependent Cosine Function 

Hierarchy 1, Level 1 
Figure V -30 



222 

• I 



V- 65 -. 



contribution of l/ /~k, an amount which is considerably smaller for 
longer profiles (large clusters) than shorter profiles (small clusters)* 
Thus for the same number of matching profile and query terms, the shorter 
profile obtains a larger correlation. Furthermore, prior to deletion, 

t 

all profiles contain about the same set of terms so it is quite probable 
that several query-profile correlations result in the same number of 
matching terms. In the case of ? 2 or P^ vectors, these terms are 
differentiated by their weights; obviously this is not the case for 
vectors* Consequently, in usual circumstances, small clusters (hence, 
short profiles) represented by unweighted profiles usually receive high 
correlations and are expanded, regardless of their relevancy. If the 
above conjectures are true, then selective term deletion should improve 
performance by reducing the occurrence of query-profile matches which 
involve the same number of terms and by making the values of l//k more 
uniform throughout the profile collection* 

In order to test this idea, unweighted profiles are made from 
weighted profiles after deleting unimportant, low frequency terms as 
described earlier. Specifically, P*(6 - -l) vectors have the weights 
of all remaining terms set to a constant value. These unweighted profiles 
are denoted by Pj(& ■ -l). Since each vector contains only those 
terms which are most characteristic of the corresponding cluster, it is 
much less likely that correlations involve the same number of matching 
terms. Figures V-31 and V-32 compare the performance of unweighted pro- 

.r 

files with and without such term deletion. Figures V-33 and V-34 contain 
the data related to their bias behavior. On both levels, a performance 
improvement and unbiased behavior ic noted so the previous conjectures 
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appear valid. In fact, the performance difference between the shortened 
weighted and unweighted vectors (curves for PJ(6 - -l) and P*(6 « -l) ) 
is much smaller than expected # This suggests, perhaps, that fine frequency 
distinctions among Important index terms axe much less valuable than selec - 
tion of good index terms themselves . The selection procedure used here 
(deletion) is crude and does not attempt to obtain good index terms, but 
tries to eliminate bad ones. However, if an independent procedure can be 



found which selects only pertinent terms, then it may be that complex 
weighting schemes are completely unnecessary. 

Proa the previous experiment one concludes that once noise terms axe 
removed from a document or profile, fine term distinctions based on fre- 



gories might provide as good a performance as a complete weight range? 
hence the notion of partial weighting as opposed to the previous full 
range weighting . In order to test partial weighting, the terms in 



following procedure. Given a profile, new values of u and a are computed, 
and the smallest (MIN) and largest MAX) remaining weights are determined. 
Bounds on the weight classes arei 






quency are not particularly useful. In fact, a few broad weight cate- 




vectors are placed in one of four weight classes by the 
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Finally each term is assigned a new weight equal to the midpoint of the' 

class indicated by its original weight. Figure V-35 compares the perfor- 

* , « # 

* * * • ♦ 

mance of P*(6 - -l) profiles' with full, partial., and no weighting. The 
results obviously substantiate the usefulness of partial weighting, at 
least for this hierarchy. This is not surprising considering that 

a) the use of weight classes eliminates correlation 
domination from very high frequency terms and 

b) deletion of low weight terms removes many terms that 
do not play a part in causing cluster formation. 

As expected, unweighted vectors provide somewhat poorer performance in 
spite of the improvements caused by term deletion. More surprising is 
the fact that only 4 weight classes appear nearly equivalent to a full 
range of weights . A 2-class scheme (LOW and HIGH) could be expected to 
provide performance slightly inferior to 4 classes. Two classes would 
be the easiest to implement in the SMART system since the sign bit of 
term identifiers (concepts) could denote the weight. An interesting way 
of producing weight classes in a document or profile might be to assign the 
weight w to a term of frequency f according to the formulat 

w » MAX ^ 0 , [log/1 (V-9) 

where x and y are constants. The logarithm function smoothes out the 
weight range (similar to the use of ranks)) the ceiling function produces 
a number of weight classes (l, 2,...)) and y acts a deletion cutoff if only 
terms with positive weights are retained. This scheme appears to produce 
a vector with all the desirable properties mentioned so far and uses an 
extremely simple mechanism. Undoubtedly, there are many such schemes. 
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Symbol . Level Symbol Level Profile Description 

A 1 □ 2 Pull weighting 

O 1 o 2 Partial weighting (4 classes) 

O 1 V 2 Unweighted 

Performance of Profiles with Full, Partial, and No Weights 
Hierarchy 1, P^(6 » -l) Profiles 

Figure V-35 
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partially weighted profiles. The results ind icate that 

a ) term deletion In u nweighted profiles causes significa nt 
performance improvement and reduces their bias with 
r espect to cluster size and. 

b) j-jofOes with partial wei ghting (» classes) can achieve 
yiaT-ff.Ti.Ance equivalent to full range weighting along 
with some storage economy . 

Whether or not a system can take advantage of the efficiency in partial 
weighting may depend on implementation factors such as machine word size 
and the number of bits allocated to term identifiers . Otherwise the 
choice lies at the extremes of no weighting or full range weighting. In 
either case, the fact that a small number of weight classes (l, 2,...) 
works so well points out that fine distinctions among term frequencies 

is not needed for document retrieval. 

8. S ummar y of Results for Hierarchy 1 

The descriptions, methods, test procedures, and results presented 
in the preceding sections all deal with Hierarchy 1 and cluster-oriented 
evaluation. These are used because the hierarchy contains the number and 
size of nodes considered typical and because the evaluation is independent 
of a number of search parameters. The large number of options to be 
tested make it impossible to examine them using all three hierarchies 
and both evaluation methods available. At this time, it is appropriate 
to summarize the findings for Hierarchy 1 and to select several options 
for more complete testing in the other clustered collections . Below, the 
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findings are listed by section* 

A. Standard Profiles 

1) Weighted profiles perform significantly better than 
unweighted profiles, P«>P^ and 

2) Term weights based on document frequency appear equiva- 
lent to weights based on total term occurrence, P^S* 3? 2 * 

3 ) A slight performance advantage is observed for unweighted 
profiles made from the shorter P 2 vectors as opposed 

to those made from the longer P^ vectors. 

B. Rank Value Profiles 

✓ 

1) Base values should be kept small to maintain distinctions 
among terms j however too small a value biases search 
results. 

2) Use of a minimal weight origin, constant for all pro- 
files, enhances performance, eliminates bias, and avoids 
the problem of base value selection. 

3) Weights based on frequency ranks avoid correlation 
domination and give better performance than weights 
based on frequency counts, P* ^P^ and P*>P 2 » 

C. Biased Search Results 

l) Unweighted profiles and rank value profiles using a 

large base value show a definite bias in favor of small 
clusters. Reducing the base value decreases the bias 
and, to some extent, is accompanied by a performance 
improvement. 
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2 ) 

3 ) 



P , P„, P*, and P* vectors show very little bleus in their 

£ j c* j 



search performance within the same hierarchy level, 
PF-RC performance improvement has a high correlation 



with a reduction of ‘bias} however, reducing bias does 
not necessarily produce an automatic performance 



improvement, 

4) There is no need to Include cluster size as a factor 
in determining qu ery- profile correlations within the 
same hierarchy level, 

D, Profile Length 

1) A large number of low frequency terms can be deleted 
without greatly reducing search performance, 

P*«P»(6 = -1). 

2) Major profile terms cannot be selectively deleted nor 
transfered to a parent profile 

P*(6 - -1)>P*(5 - -1, - 1). 

E, Frequency Considerations 

Term weights within individual profiles should be non- 
decreasing with frequency, 

F, Unweighted and Partially Weighted Profiles 

1) Term deletion improves performance and eliminates bias 
when using unweighted profiles, PJ(6 • -1)>P 1 . 

2) A limited number of weight classes give performance 
which is equivalent to using a full weight range, 

3) Fine term distinctions based on frequency are of limited 
appropriateness in gauging retrieval significance. 
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it is easy to see that a number of findings are related. For example, 
vector length affects storage considerations and the performance of 
unweighted and partially weighted profiles. Another example, frequency 
ranks and weight classes are both techniques for reducing correlation 
domination. And another, the choice of base value (or weight origin) or 
the use of unweighted v ectors strongly affects the amount of bias in 
search results. 

The profile types selected for use in the confirmation tests on 

Hierarchies 2 and 3 are Py P*, P*(6 *» -l), and P*(6 = -l). The 

vector is a standard profile serving as a basis of comparison. The 

P* profile showed the best performance of any in Hierarchy 1, The 

3 

remaining vectors are more economical versions of the P*. All contain 
those qualities found to be most beneficial thus far* unbiased behavior, 
lack of correlation domination, and storage economy. 



9, Confirmation Tests 

To verify the previous results, a subset of the experiments are 
repeated on Hierarchies 2 and 3, If such confirmation tests yield the 
same general results, then the conclusions drawn from the earlier experi- 
ments are greatly strengthened. As mentioned earlier, four types of 
profiles are used in the confirmation tests thus making possible to 
investigate* 

a) the superiority of relating profile term weights to 

frequency ranks rather than frequency counts (P* versus 
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b) the ability to delete a large number of low weight 
profile terms without a large performance loss 
(P$(6 - -l) versus P*)j and 

c) the relative performance of shortened profiles with and 
without term weights (Pj(6 ■ -l) versus P^(6 « -l) ). 

'.The results of these tests determine! to a large extent , the best profile 
for searching a clustered hierarchy. 

Since Hierarchies 2 and 3 have received little attention, it is 
appropriate to review their properties as described in Section IV, 3 and 
summarized in Table V-4. Hierarchy 1, used exclusively so far, has low 
overlap (, 7 %) and document clusters which approximately fill one disk 
track (28 documents) . Hierarchy 2 has high overlap { 91 %) and document 
clusters of about the same size. Because of the high overlap and the 

i 

fact there are only a few nodes on level 1, it is possible to make only 
very broad distinctions among them. Hierarchy 3 has no overlap and 
averages 14 documents per cluster. Since there are a moderately large 
number of nodes per level, the search algorithm should have less 
difficulty distinguishing relevant profiles than in the other hierarchies. 
In all cases the shortened profiles ( 6 * -l) have about 20 % of the 
length of their original vectors. 

The confirmation experiments employ both cluster- oriented evaluation 
(RC-PF data gathered from both levels) and SMART evaluation (P-R data 
from narrow and broad searches) as described in Section IV ,4, Consequent- 
ly, each of the 4 profiles types is involved in 4 searches in 3 hierarchies, 
•«. making a total of 48 performance curves. Because of the many variables 
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Property 




• 

1 


Hierarchy 

2 


3 


Level 1 (Profiles) 


Number of nodes 




13 


6 


28 


Average crown 




.115 


446 


50 


Average sons 




4 


16 


4 


Average profile length, 


P 3 or P* 


812 


908 


526 


Average profile length, 




141 


207 


103 


P*(6 m -i) or P^(6 


- -1) 


(m) 


(23%) 


(ZO%) 


Level 2 (Profiles) 


Number of nodes 




. 55 


94 


103 


Average crown 




27 


28 


14 


Average profile length, 


P 3 or P* 


323 


311 


197 


Average profile length, 


P*(6 - -1) or P*(6 


- -1) 


(220) 


(220) 


(240) 


Level 3 (Documents) 


Number of nodes 




1500 


2679 


1400 


Overlap* 




70 


91% 


00 



♦Overlap— ratio of total nodes on level 3 to collection size (1400) 
less one. 
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involved and the large number of curves* the actual plots are Included as 
a special section (Appendix C) . A result summary is contained in Table 
V-5 which shows the relative merit of each profile in each test case. 

For example* in Hierarchy 3 and Tor a broad SMART search, the P* profiles 
perform best, and P^ and P*(6 «* -l) vectors give equivalent perfor- 
mance and share an average rank of 2§-, and the P£(6 ■ -l) profiles 
performed poorly. The individual results for each case do not differ 
greatly from the overall results, thus indicating that the findings are 
stable and not coincidental. Consequently, it is likely that the effects 
observed here occur in most other document collections besides the 
Cranfield, The relative merit of each profile is about the same through- 
out all hierarchies, so the following summary conclusions can be made, 

a) Weights of profile terms should be based on frequency 
ranks . (P*>-P^) , The use of ranks is an effective way 
of reducing correlation domination and bias in search 
results, 

b) A large number of low weight profile terms can be deleted 

without a large performance loss . P*(6 =» -l). 

For the chosen deletion parameter, the performance loss 
from using about 20/6 of the original terms is about the 
same as the performance gain made by switching term weights 
to ranks. Whereas this amount of deletion is probably 

not optimal, it does indicate that a large length reduc- 
tion does not lead to disastrous search results. 
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♦Entries denote merit in terms of rank* first, second, etc 
Ties are given the average rank « 



Relative Merit of Selected Profiles in Confirmation Tests 
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c) Shortened unweighted profiles performed poorly In every 
case and should not he used , P*(6 ® -l) >P*(6 » -l). 

However, the previous sections suggest that complex 
weighting schemes may not he necessary. 

In general, the findings for Hierarchy 1 are substantially confirmed by 
these tests, a possible exception being the case of unweighted vectors. 

In the initial collection the P*(6 ■ -l) profiles gave promising per- 
formance (still low), which did not show itself in the other hierarchies. 

What is missing from this discussion naturally, is the description 
of just how profiles are ranked on the basis of relative merit and what 
constitutes a significant difference in performance. A detailed discus- 
sion of these problems is reserved for Appendix C, but a summary is as 
follows* Generally a difference in measures (PF-RC, P-R, NR, NP, 

etc.) is considered significant. This is about half the amount used in 
earlier SMART experiments. However since 4 times as many requests are 
involved here, conclusions have about the same level of confidence. In 
cluster-oriented evaluation, the PP-RC curves are compared on a point-to- 
point basis and judgment rendered on the merit of each profile type. 

Since the curves are generally parallel, this technique poses no problems. 
SMART'S precision-recall plots and accompanying normalized measures pre- 
sent some difficulty because the number of document and profile correlations 
differ among searches. At times it is necessary to determine whether 
performance should be traded for cost (fewer comparisons). The following 

criteria are used in these circumstances i 

* 
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a) the most desirable profile Is that giving superior search 
performance for the least effort) 

b) for the same number of correlations on each level , merit 
is determined directly from P-fi values and the normalized 
measures) 

c) profile correlations weigh much more heavily than docu- 
ment correlations in determining "equal effort") and 

d) a 2% difference in normalized measures or a ^ difference 
in P-R curves is considered significant and is offset 
only by substantially less search effort (one or fewer 
profile correlations) , 

Using this procedure, the relative merit of each profile type is obtained 
for each hierarchy and the entries in Table W5» This data, then, leads 
to the summary conclusions stated earlier in this section, 

10, Discussion 

This chapter presents a long series of experiments related to profile 
construction. In the process, new analysis and evaluation methods are 
developed which have applications to other studies as well. The results 
of the Initial experiments are adequately summarized in Section 8: Section 

9 summarizes the confirmation tests. Here, it is sufficient to say that 
techniques have been developed for constructing an adequate and economical 
profile for a clustered document collection. 

This does not imply that there is no room for improvement. If the 
best precision-recall curves are compared with a similar curve for a full 
search (Figure IV-5) , cluster searching appears to give very poor 
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performance. However, it must be said that cluster searching is not 
designed for high recall work, but for flexibility and cost-performance 
tradeoff. Still, a comparison of the best FP-RC curves with the best 
achievable results for the same collections (Figures IV-7 and IV-8) show 
that profiles, search strategies, and correlation functions could be 
greatly improved. Consequently, under the current situation, search 
strategies should be designed on a minimum exclusion •principle . That Is, 
only nodes with very low correlations should be excluded from considera- 
tion while the rest are expanded. This contrasts to the current minimum 
inclusion philosophy which expands as few nodes as possible to provide a 
user with his requested number of documents. The former procedure results 
in greater cost, but more effective retrieval. Hopefully on larger collec- 
tions (100,000 items), 90 $ of the documents could be easily excluded while 
the rest require detailed examination. 
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Chapter VI * 



File Maintenance Experiments 
1, Introduction 

File maintenance (or updating) is the process by which new documents 
sore added to the data base, including whatever re-organization is required 
to maintain standards for search time, storage economy, and quality of 
retrieved output, A good updating procedure is especially important in 
a clustered collection in order to prolong the useful life of the document 
classification. However, Chapter III points out that no hierarchy can 
stand unlimited growth without changes to its profiles and structure. One 
part of file updating, then, is the alterations to be made to individual 
profiles. Presumably, altering a profile shifts it to a position within 
the cluster which more adequately represents the combination of new and 
old documents. These experiments examine five alteration procedures for 
each profile on a document update paths 

Original Profile Type 

Maintenance Procedure Weighted. P*(6 ° -l) Unweighted, P*(6 « ~l) . 

Construct completely X X 

new profiles 

Alter weights of only X 

existing profile terms 

Use existing profile- X X 

(odd document to 
crown only) 

The purpose of the first study in this chanter is to determine which pro - 
file maintenance schemes are most effective. 
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At some point, file maintenance requires changes to be made in the 
hierarchy structure because new documents alter the character of the 
collection and ultimately cause the original classification to lose its 
value. For example, clusters may become too large or polarized and cannot 
be represented adequately by a single profile. Or many additions in one 
part of the collection might indicate a more logical classification would 
split various clusters and combine their parts in a different way. With- 
out re-structuring (re-clustering), the hierarchy degenerates quite apart 
from any changes made to profiles. The purpose of the second study is to 
determine how quickly the retrieval quality (precision-recall) diminishes 
as the file grows . The rate of hierarchy degeneration is important 
because it determines the interval between full or partial re-clustering 
of the document collection. 

The experimental approach is to divide the document collection into 
two groups. One part is clustered and the other is used to update this 
"original” hierarchy. After updating, the query collection is processed 
through the combined collection while recording performance statistics 
(PF-RC, P-R, etc,). Since each request has the potential of recovering 
all relevant items, performance differences are directly due to the up- 
dating scheme and to the proportion of the collection in the updating 
group. If the size of the updating group is held constant, then the 
relative value of the profile maintenance procedures can be studied. 
Choosing a single maintenance procedure and varying the size of the up- 
dating group shows how the quality of the hierarchy changes with the 
addition of new items. The test results 3how the superiority of using 
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2. Method 

The following experiments use two partitions of the Cranfield 1400 
documentsi one separating documents into two equally aimed sets referred 
to as A and B, and another which divides B into the halves C and D. The 
result is four collections for testing file maintenance proceduresi 



Clustered. Collection 

Set % of Total 


Update Collection 

« 

Set # of Total 


AUB 


100* 


0 


0* 


AUC 


75* 


D 


25* 


A 


50* 


B 


. 50* 


D 


25* 


AUC 


75* 



The partitions deliberately maximize the number of queries affected by 
updating. For example, if 73* of the total Cranfield collection is to be 
clustered (set AUC) , then to the extent possible, 73* of the relevant for 
each request are placed in that set, leaving 25* of the relevant for each 
request in the updating set. Consequently, searches take full advantage 
of the large number of queries and reliably show the consequences of file 
updating. In particular, the variance of behavior among the queries is 

greatly reduced. 
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The collection partitioning algorithm is a manual procedure for 
dividing a set of documents into two non-overlapping sub-sets, each con- 
taining half the relevant for each query. . Initially, the scheme is applied 
to all 1400 documents to generate sets A and'B and then to *et B to obtain 
subsets C and D. Actually, the algorithm considers only relevant docu- 
ments and splits the non-relevant items afterwards. To control overall 
characteristics, queries are processed in order of decreasing number of 
relevant documents. Given a specific query, counts are made to determine 
how many of its relevant are already assigned to each partition set. The 
remaining relevant are assigned so that 

a) for the entire query, all relevant are split evenly 
into each partition set; 

b) relevant documents with consecutive numbers are not 
assigned to the same set; and 

c) the cumulative number of items in each partition set 
is approximately the same. 

Condition b is required since the Cranfield collection is arranged in 
subject order (somewhat). Condition c takes care of requests with an odd 
number of relevant. At times, previous assignments must be reworked to 
accommodate new queries; the frequency of these "backtracking" instances 
is reduced because of the order of query processing. In the partitions 
made for the test experiments, about 2 % of the total assignments could not 
be made by the above criteria without extensive, prohibitive backtracking. 
Even though •’erroneous" assignments are made in these cases, the collections 
are sufficiently accurate for their intended purpose. Appendix B lists 
the members of sets A, C, and D, 
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The classification parameters (Datt ola's clustering algorithm) for 
these experiments are designed to produce clusters similar to those of 
Hierarchy 1 in the previous chapter. A reasonably constant cluster size 
is maintained to provide comparable data collections and to simulate 
adherence to an operational guideline of holding to an opti m a l cluster 
size (if an optimum were actually known). These particular parameters 
yield clusters of moderate size and overlap and are considered suitable 
for larger files. Hierarchy 1 represents a collection without updating; 
its performance statistics are those which would be obtained if an updated 
collection were freshly clustered using the chosen parameters. The 
hierarchies generated from the other sets are subjected to various amounts 
of updating. In both cases nodes are characterized by two types of 
profiles i P*(6 » -l) vectors for weighted updating and Pj(6 = -l) 
vectors for unweighted updating. Shortened profiles are selected because 
of their smaller storage requirements. Otherwise the profiles are those 
giving the best performance for weighted and unweighted vectors. Table 
Vi-l compares the properties of the four clustered collections before 
updating occurs. In general, the goal of having hierarchies with similar 
characteristics is achieved, allowing for the sizes of the collections 
involved. The most unfortunate difference is the amount of overlap, being 

considerably higher for Hierarchy 4, 

The update procedure for each document begins by determining its 
update path , that is, the best matching node on the first hierarchy level 
and the best matching node among its sons. As before, the cosine function 
is used for matching. If no profile alterations are involved, the new 
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Hierarchy 

Property 1 — 5 



Cluster set 

Percent of original Cranfield 
Collection size 



AUB C 

100# 75^ 

1400 1050 



A D 

50 £ 

?00 350 



Level 1 (Profiles) 

Number of nodes 
Average crown 
Average profile length 
Average number of sons 



13 

115 

141 

4 



9 

133 

149 

4 



6 

128 

148 

4 



3 

118 

150 

4 



Level 2 (Profiles) 

Number of nodes 
Average crown 
Average profile length 



55 


36 


24 


12 


27 


38 


34 


30 


70 


86 


82 


83 



Level 3 (Documents) 

Number of nodes 
Overlap 



1500 1367 81 

% y& 17 # 




Properties of the Original Clustered Collections Before Updating 



Table VI-1 
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document is simply associated with the selected nodes as described in 
Chapter III. If profiles are modified or re-constructed, there are several 
options to exercise. In occasional updating or in real-time operation, 
profiles are modified each time a document is processed. If used in these 
experiments, profile changes and search results would depend on the order 
in which documents are added. Instead, a batch update mode is considered 
in which the update paths are determined for all new items and all profiles . 
are altered afterward. Because each profile is changed only once, the 
final hierarchy configuration is independent of the order in which 
documents are added. The extent to which these experiments predict 
behavior over many smaller batch updates is unknown. A few documents 



would probably join different clusters, however the number of such in- 
stances is expected to be small. If profiles are completely re-made 



after updating, processing simply follows the rules laid down in Chapter 
V. That is, the document vectors in a node' 6 crown are appropriately com- 
bined and perhaps re-weighted; and a term deletion cutoff is applied. Such 
vectors represent the best profiles that can be constructed for the updated 
hierarchy. In general, new profiles are longer than their previous versions 
and cause fragmentation of disk storage since the new vectors cannot 
exactly overwrite their predecessors. In the case of weighted profiles, 
there is another file maintenance option which alters the weights of only 
the existing profile terms. Since no new terms are added, the vectors 
maintain their original length and can overwrite their predecessors. 



Specifically, consider a set of update documents U “ l D r D 




for a node whose profile is P*(6 = -1). The updated profile is 
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H«(8 - -X) “ P!J(6 - -X) ©Z D j. ( VI * 1 ) 

The operator© denotes normal component- wise vector addition , but 

limited to only non- zero elements of the left-hand operand. The resulting 

% 

% 

profile is a hybrid in that it combines the weights of a P* vector (based 
on frequency ranks) with the term weights of (summed frequency counts) . 
There .is no easy way to remove the hybrid weighting property without pro- 
ducing a completely new vector. On one hand. could be converted to 

use rank weighting and then added to the original profile using the© 
operation. The result is still not a true P* vector. Instead of a partial 
solution such as this, the complete hybrid is used in this study. Some 
consequences of this choice are discussed in Section VI .3* 

Because the experiments are conducted with weighted profiles first 
and then with unweighted profiles, there are two different versions of the 
updated collections. Table VI-2 compares the properties of the two final, 
updated collections and shows that 

a) the average new profile length differs little 
in the two collections; 

b) there are slightly fewer relevant nodes when weighted 
profiles are involved; and 

% 

c) there is 62 %-!($> agreement on the first node of the up- 
date path and 47#-6(# agreement on the complete path. 

Comparing collections before and after updating reveals significantly 
longer profiles (new) and a lower percentage of overlap. The reduction 

in overlap is due to the fact that new documents are associated with only 

« 

one node on each level, A fact not shown in the table is that the amount 
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Property 



Hierarchy 

1 j 5 i 



Update set 

# of original Cranfield 



i D 

0# 236 



B C 

50# 75# 



Level 1 (Profiles) 

Number of nodes 
Average crown 
Weighted updating 

Average new profile length 
Average relevant nodes 
Unweighted updating 

Average new profile length 
Average relevant nodes 

Agreement in update path 
(first node only) 



13 

115 


9 

172 


6 

245 . 


3 

468 


141 

3.9 


158 

3.5 


185 

2.8 


245 

2.0 


141 

3.9 


156 

3.5 


185 

2.9 


2.1 




62# 


71# 


76# 



Level 2 (Profiles) 



Number of nodes 
Average crown 
Weighted Updating 

Average crew profile length 
Average relevant nodes 
Unweighted Updating 

Average new profile length 
Average relevant nodes 

Agreement in update path 
(both nodes) 



55 


36 


24 


12 


27 


48 


63 


117 


70 


86 


105 


118 


5.3 


5.3 


4.4 


.3.4 


70 


89 


99 


127 


5.3 


5.4 


4.6 


3.5 




47# 


52# 


60# 



Level 3 (Documents) 

Number of nodes 
Overlap 



1500 1727 

7# 23# 



1519 1410 

0# 1# 



Properties of the Updated Collections 



Table VI-2 
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of increase in cluster size varies considerably about its expected value 
for each hierarchy. In other words, the update scheme causes some clusters 
to receive many additions, some to experience moderate growth, and others 
to have only a few changes. However, the amount of growth varies more 
widely if unweighted profiles are involved. In all cases, increases in 
cluster size are not skewed toward many large increases and a few small 
increases or vice versa. Instead the distribution appears uniform, but 
with a large standard deviation. Uniform growth is not completely un- 
expected because of the way the collection is partitioned, but it is not 
a direct consequence of the partition either, 

3 , Profile Maintenance Procedures 

As new documents are added to an existing hierarchy, the nature of 
clusters changes also. That is, they contain different information or 
heavier concentrations of older information, etc. A logical move, then, 
is to alter cluster profiles to reflect this change in character. The 
purpose of the experiments in this section is to determine which profile 
maintenance procedure is most beneficial . Hence, the constant quantities 
are the amount of file updating and the assignments of particular new 
documents to clusters within either the weighted (WTD) or unweighted 
(UNWTD) profile hierarchies. The variable quantity is the method of pro- 
file alteration itself; several schemes are discussed in Sections 111,5 
and VI, 1 and are denoted as follows! 

NEW - make completely new profile vectors, adding new terms, 
re-weighting, etc,; 
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ALTER - alter weights of existing profile terms only (l.e., 
produce hybrid vectors)} and 
NONE - no change to the existing profile terms. 

In all cases, new documents are properly linked to the hierarchy so that 
all items are retrievable during searches. Figures VI-1 to VI-5» report 
the precision floor and recall ceiling statistics for searches using 
these profile maintenance procedures in the various hierarchies. Only 
level 1 of Hierarchy 6 is omitted because it contains only two independent 
data points. On each level, the results are amazingly consistent. 
Obviously updating a collection with weighted profiles is superior to 
updating a collection with unweighted profiles, just as predicted by 
the distribution of relevant clusters (Table VI-2.), This finding closes . 
the case against the use of unweighted profiles in almost any circum- 
stances, Here, the unweighted updated collections perform poorly regard- 
less of whether the old profiles or completely new unweighted profiles 
are used. 

Considering just the weighted profiles, the new (NEW) and hybrid 
(ALTER) vectors provide nearly equivalent performance even though the new 
vectors are often substantially longer. In most cases the additional 
terms are those just below the 6 ” -1 cutoff in the original profiles 
before any updating occurs. Hence, the terms have only a marginal effect 
on performance. It is quite advantageous that the hybrid and new vectors 
are nearly equivalent since 

a) NEW vectors represent the best reasonable profiles that 
.can be made for an updated collection and 



253 



VI-12 




Symbol 


Maintenance 

Procedure 


Profile Type 


Average 

Length 


A 

♦ 


NEW 


WTD P*(6 » -1) 


158 


O 


ALTER 


WTD P*(6 - -1) 


149 


□ 


NONE 


WTD P*(6 = -1) 


149 


A 


NEW 


UNWTD P*(6 = -1) 


156 


O 


NONE 


UNWTD P*(6 = -1) 


149 



Comparison of Profile Maintenance Procedures 
Hierarchy 4, Level 1, 25& Updating 
Figure VI-1 
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Symbol 


Maintenance 

Procedure 


Profile Type 


Average 

Length 


V 

• 


NEW 


WTD P*(6 - -1) 


185 


0 


ALTER 


WTD P*(6 » -1) 


148 


□ 


NONE 


WTD P*(6 « -1) 


148 


A 


NEW 


UNWTD P*(6 » -1) 


185 


O 


NONE 


UNWTD PJ(6 - -1) 


148 



Comparison of Profile Maintenance Procedures 
Hierarchy 5» Level 1, 5($ Updating 
Figure VI-2 
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Maintenance Average 

Symbol Procedure Profile Type Length 



V. 


NEW 


WTD P*(6 » -1) 


86 


o 


ALTER 


WTD P*(6 - -1) 


86 


□ 


NONE 


WTD P*(6 = -1) 


86 


A 


NEW 


UNWTD P*(6 •« -1) 


89 


O 


NONE 


UNWTD P*(6 - -l) 


86 



Comparison of Profile Maintenance Procedures 
•Hierarchy 4, Level 2 , 2 % Updating 
Figure VI-3 
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Maintenance Average 



Symbol Procedure Profile Type Length 



V 


NEW 


WTD P*(6 - 


-1) 


105 


o 


ALTER 


WTD P*(6 - 


-1) 


82 


□ 


NONE 


WTD P*(6 » 


-l) 


82 


A 


NEW 


UNWTD P*(6 - 


-1) 


99 


O 


NONE 


UNWTD P*(6 « 


-1) 


82 



Comparison of Profile Maintenance Procedures 
Hierarchy 5» Level 2 , Updating 
Figure VI-4 
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Precision Floor 




Maintenance Average 

Symbol Procedure Profile Type Length _ 



V 


NEW 


WTD P*(6 = 


-1) 


118 


o 


ALTER 


WTD P*(6 = 


-1) 


83 


□ 


NONE 


WTD P*(6 = 


-1) 


83 


A 


NEW 


UNWTD P*(6 = 


-1) 


127 


O 


NONE 


UNWTD P*(6 - 


-1) 


83 



Comparison of Profile Maintenance Procedures 
Hierarchy 6, Level 2, 75^ Updating 
Figure VI-5 
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b) ALTER profiles maintain their original lengths and 

reduce fragmentation of disk space since they can over- 
write their predecessors, 

X 

Consequently, use of the hybrid (ALTER) profiles is preferable In actual 
retrieval systems. 

Surprisingly i hybrid vectors do not seem to experience correlation 
domination. Since their term weights are based partially on frequency 
ranks (from the original profile before updating) and partially on fre- 
quency counts (summed over the new documents) , some domination could 
occur when a node experiences a large number of additions. Under these 
conditions, some terms of its profile have their weights considerably 
increased. However as suggested in Section V, 4, domination involves 
only a few terms with very high weights and here the use of ranks in the 
original profile may reduce weights enough so that domination is not ob- 
served, A less likely explanation holds that all terms have their 
weights increased in proportion to their original values so that contri- 
bution ratios remain roughly constant throughout the collection. 

Weighted profiles which remain unaltered after updating (NONE option) 
perform slightly less well than NEW profiles for modest amounts of up- 
dating ( 25 ^) and less well otherwise. This demonstrates the suitability 
of an update procedure which makes no changes to the profile terms at all. 
The effect of no alteration is an important consideration in the use of 
partially weighted profiles, for example, (see Section V,7) in which 
term weights cannot be altered without destroying the entire vector. In 
all cases, updating requires some changes to one or more profiles in 
order to link new documents to upper hierarchy levels. Since these vectors 
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must be re-written anyway, there is little extra savings from not re- 
weighting terms on at least the lowest level where at all possible. As 
mentioned above, an exception to this is partially weighted profiles. 
Finally because weighted profiles maintained under the NEW and NONE 
options do not differ widely, it is apparent that profiles do not need a 
great ability to move about their clustered document subspace in order to 
characterize new items. Some movement seems advisable, but not a great 
deal is required. This may be related to the fact that the partitioning 
and update schemes result in more or less uniformly distributed increases 
in cluster size. Even so, the assumption of random subject acquisition 
over the entire collection still renders the present results applicable 
to practical situations since bulk additions in a single subject area 
can be viewed as random acquisition in a subtree of the original hierarchy. 
A summary of the findings in these experiments includes the following, 

a) Weighted profiles are superior to unweighted profiles 
with respect to updating, in that they result in fewer 
relevant clusters and earlier searching of these clusters . 

b) Considering .lust the use of weighted profiles, there is 
little difference among the maintenance options NEW , 

ALTER, and NONE, The last option— no changes to profiles — 
is slightly inferior for a large number of additions to 
the file, 

c) The simplest and most reasonable update procedure is that 
of making hybrid profiles (ALTER option). That is, the 
weights of existing profile terms are adjusted by equation 
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VI- 1 , but no terms are added. In addition to providing 
good performance, these vectors retain their original size . 
The experimental results presented here are given In terms of RC-PF per- 
formance curves. SMART precision-recall plots for 18 narrow and broad 
searches using these profiles may be found in Figures VI -6 to V-12 in 
the next section. These curves are not repeated here because they provide 
no information that changes the above conclusions. 

Recent work by Kerchner ( 3 ) In this area substantiates these findings. 
This work uses a different hierarchy, employs slightly different update 

4 

procedures, considers a 5 0 & update fraction, and uses a single search 
strategy with one iteration of relevance feedback. 

4, Degeneration of the Hierarchy 

A document classification bases its groupings on the data available 
at a single moment of time. Afterward, new items are blended into the 
existing structure. In general, updating causes a reduction of search 
speed since new documents may be stored in overflow areas away from the 
rest of their cluster. Indexed sequential access schemes use a number of 

4 

techniques to handle overflow records such as storing them in the same 
cylinder .or pack} writing blocked or unblocked records; and using 
dynamic cylinder reorganization (l, 2). Since SMART simulates cluster 
searches, it is impossible to measure the exact increase in search time 
due to updating. A rcugh approximation suggests that a full disk rota- 
tion might occur between inputs of document records (unblocked) in the 
overflow area. In any case, search speed can be increased simply by 
re-writing the file in correct physical sequence. Although there is 
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expense involved in this solution, it involves only data movement and not 
a structural re-organization of the file. 

In addition, new documents subject a hierarchy to a subtle form of 
degeneration that ultimately necessitates partial or full re-clustering. 

The problem stems from the fact that new documents alter the character 
of the collection and individual clusters so that the original classifica- 
tion loses its value. Regardless of whether updating Includes alterations 
to profiles, clusters may become polarized or very similar in content 
(see Figure III-5)« As a result, users receive poorer output from 
searches because profiles no longer accurately represent all the docu- 
ments in their clusters and because the classification is no longer a 
"logical" partition of the collection. The solution to this problem is 
some form of re-clustering. The experiments in this section attempt to 
determine how frequently re-clustering must occur as a function of file 
growth . Consequently, a consistent profile maintenance scheme is used 
and the amount of updating varies among the tests. 

Evaluation of these experiments is extremely difficult because they 
involve hierarchies with large differences in cluster sizes. Ideally, 
the same search strategy is used throughout, the amount of search effort 
is constant, and the amount of hierarchy degeneration is observed in the 
precision-recall curves. With small deviations, these conditions are met 
in the previous tests because comparisons are made within the same hierarchy. 
In the present tests, both the strategy and search effort cannot be held 
constant even though the file is always assumed to be freshly re-sequenced 
so that maximum search speed is attained (no items reside in overflow 




VI- 21 



O 

me 



areas). The difficulty in observing the degeneration arises from the 
following conditions! 

a) the larger the percentage of updating, the larger the 
final cluster sir.es j 

b) since the same number of clusters are expanded in all 
tests, differences in cluster sizes among the hierar- 
chies makes it impossible to keep the number of docu- 
ment correlations relatively constant in all cases; and 

c) both precision and recall generally increase as the 
number of document correlations increase* 

The circular nature of these conditions implies that a performance 
improvement might be observed with increased amount of updating simply 
because more documents are examined. This is somewhat true if a constant 
search strategy is maintained since only large amounts of hierarchy 
degeneration would show in P-R curves. Consequently, there are two 
schemes for observing the desired effect! (l) to maintain a constant 
search strategy and judge degeneration from the P-R differences and the 
number of profile and document correlations and (2) to alter the search 
strategy among runs to equalize the number of document comparisons before 
judging degeneration from P— R differences. The second method equalizes 
the number of comparisons, and consequently does not measure degeneration 
using exactly the same search procedure in all cases. 

In the following experiments, various fractions of the Cranfield 
collection are used in updating in conjunction with each of the three 
maintenance schemes for weighted profiles (NEW, ALTER, NONE) . The original 
profiles, before updating, are of the P*(6 = -1) type; afterwards the 
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vectors are either of the same type (NEW, NONE) or hybrids (ALTER). For 
each maintenance scheme two complete searches are made using the narrow 
and broad SMART search strategies outlined in Chapter IV. The resulting 
P-R curves allow evaluation by method 1. For example, Figures VI-6, 8, 10 
show the P-R plots, normalized measures, and number of correlations per 
level, C(x), for four narrow searches on document collections subjected 
to various amounts of updating. In all tests, the same search strategy is 
used. Figures VI-7, 9* 11 show similar data for a broad search. 

Especially for the narrow strategy* performance drops off steadily as the 
amount of updating increases, even though the latter searches are helped 
in that they examine more documents (120-150) . In order to employ evalua- 
tion method 2, the same number of document correlations must be performed 
in all cases. Fortunately the narrow searches for 5 0 # and 7S& updating 
and the broad searches for 0# and 25 ^ updating examine about the same 
number of documents so it is possible to observe the performance loss from 
updating in this way also. (The applicable curves are joined by dashed 
lines in pairs of figures* V-6 and 7 , V-8 and 9» V-10 and 11.) Both 
evaluation schemes suggest there is degeneration of the hierarchy, but 
their estimates differ moderately. 

The precision-recall curves for the complete set of searches under all 
three profiles maintenance options are contained in Figures VI-6 to VI-11, 
As always, a number of comparisons and observations can be made. First, 
consider the curves for 0# and 2 % updating. In all broad searches the 
number of document correlations is approximately the same, and there is 
virtually no performance difference with this percentage of file growth. 

For narrow searches, about 20 more document correlations are Involved in 
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the update searches and their P-R curves are better, as expected. 
Still, the collection seems to accommodate at least 25# updating with 
little or no performance loss. Second, in. nearly all cases the curves for 

s 

50 # and 75& updating lie substantially below the search simulating a 
freshly re-clustered collection (0 # update). The P-R differences are 
generally smaller in the broader searches, but the corresponding increase 
in the number of document correlations to achieve that performance 
assures that significant degeneration has occurred with this many new 
additions. This observation is substantiated further by comparing per- 
formance between the paired narrow and broad searches mentioned earlier 
(evaluation method 2). Roughly speaking, the normalized measures drop 
about for 50# updating and 8# for 1 % updating, and the P-R curves 
remain far apart. Third, all results are somewhat invariant with the 
profile maintenance procedure. This emphasizes the findings in the 
previous section; namely that all profile maintenance procedures are 
roughly equivalent with some preference, perhaps, for hybrid vectors. 

The conclusion of these experiments is that enough hierarchy decay 
occurs with 2^-50^ updating to warrant re-clustering. The " break even” 
point is probably on the low side of this range. The implications for 
partial re-clustering are obvious. To review, under this procedure each 
profile includes the number of documents in its current crown and the 
number of additions since the latest classification (update count). Con- 
sequently whenever the ratio of update count to crown size exceeds .25- 
.50, then all items beneath the node are re-clustered. An alternative 
to this scheme is to re-cluster all documents in the node's filial set 
as well . . * 



o 



5. Summary 

This chapter cpnsiders the problem of updating a clustered file and 
two very important questions i (l) how profiles should be altered to 
more accurately represent both new and old documents and (2) how fre- 
quently the collection (or node) must be re-clustered in order to recover 
from the performance loss inherent in the classification-update process. 

The results from both sets of experiments contain the following informa- 
tion. 

a) Weighted profiles have updating characteristics superior 
to those of unweighted profiles in that they result In 
fewer relevant clusters which are expanded earlier in 
searching . 

b) A clustered collection can tolerate 2^-50^ updating 
before partial or total re-clustering is necessary . 

c) All profile maintenance procedures tested (NSW. ALTER , 

NONE) performed about equivalently . In particular, new 
terms do not need to be added to profiles even if the 
original vectors have undergone earlier term deletion. 

The recommended scheme is that of forming hybrid vectors 

* 

as a result of updating (ALTER option). This scheme 
allows changes to weights of existing profile terms, but 
keeps the original profile length so the updated vector 
may overwrite its predecessor. 

These conclusions are based on test procedures which use a deliberate 
partition of the collection and somewhat unique ways of comparing precision- 
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recall curves. It Is hoped that neither of these techniques unfairly 
bias the results. 
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Chapter VII 

Experiments with Hierarchy Storage 
1, Introduction 

As stated in Chapter II, the physical file organization of interest 
is the indexed sequential access method (ISAM) as opposed to the logical 
file organization which, of course, is based on clustering, Ihe experl— 
ments in this chapter have a dual purpose related to physical organiza- 
tion! (1) to determine the i/O delays while se arching a clustered file 
and (2) to compare the appropriateness of storing pr ofiles in level and 
heir-filial order , i/o delays are measured by the number of disk accesses 
for obtaining index information and actual data items. The previous 
experiments measure search effort by the number of expanded clusters or 
correlations per hierarchy level. This is acceptable since the purpose 
of the tests is to maximize performance (PF-RC, P-R, etc.) for a given 
amount of work rather than to study search effort itself. Furthermore, 
specifying the parameter settings for actual file storage would have made 
previous conclusions less general than desired. This chapter considers 
the problem of minimizing search i/o time while maintaining a given per- 
formance. level. The experimental procedure is to determine disk locations 
for all documents and profiles using the storage algorithm in Section III, 6 
and to simulate query processing while monitoring track and cylinder 
positions. Changing the order of item storage allows comparison of the 
level and heir-filial sequences. Processing is handled so that the 
searches resemble those made earlier with SMART using Hierarchy 1 and 
P*(6 » -1) profiles. Consequently, the precision-recall performance 
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of the earlier test 3 can he associated with the amount of disk i/o 
determined here* 

2, Procedure 

The following experiments are designed to monitor the i/o activity 
while processing queries through a clustered file* Actual disk storage 
and access is not involved* Instead* a disk is simulated by constructing 
a storage map of "cylinder and track locations" for data records, ISAM 
indexes, and overflow areas* Data from previous searches and other 
sources is used to construct the map and then employ it in such a way 
as to accurately approximate an actual disk search* The general organiza- 
tion of an ISAM file is outlined in Section 11*3, This section sets forth 
the specific parameters used in the simulation and in the record storage 
algorithm as well as the chosen evaluation measures. It also discusses 
a number of difficult problems associated with physical file organization. 

Disk space management is simulated according to the ISAM philosophy 
using the parameters and data characteristics in Table VII—l, Eighteen 
tracks of each cylinder are allocated to profile, document, and index 
storage while the remaining tracks are reserved for future documents (over- 
flow areas) . Each cylinder of data is preceded by a track index and the 
first record of the entire file is the cylinder index (each index size 
is 300 bytes). The profile sizes are those of P*£6 * -l) vectors for 
Hierarchy 1. In the SMART implementation, each index term is stored as 
a weighted concept number (terra identifier) which requires 4 bytes of 
storage 1 other storage requirements per document total $6 bytes for header 
information, citation, etc. The same memory requirements are used here, 
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resulting in the average record sizes shown in the table. 

In making the map of record locations on the simulated disk, each 
track is treated as one physical record and data items are packed onto 
it using the storage algorithm in Section III, 6, This algorithm attempts 
to balance the amount of wasted disk space and the frequency of storing 
filial records on more tracks than necessary (thereto requiring extra 
accesses during retrieval) , In cases where it makes a difference, space 
is traded for access time only if the amount of waste is beneath a chosen 
threshold (0 bytes per track). Consequently, in actual tests, the perti- 
nent evaluation data is the percent of wasted data space and the percent 
of filial record sets requiring extra accesses. Table VII-2 shows the 
results of applying this storage procedure to the profiles for Hierarchy 

1. Data is given for the case of sequencing items by level as well as in 

« 

heir-filial order, although there is little difference in the outcomes. 

In both cases, ISAM indexes are stored in their appropriate locations and 
the threshold for waste is 1<# of the track size ( 0 ■» 720). Therefore 
when less than 720 bytes remain on the current track, the next filial set 
of records begins on a new track. This is a rather infrequent situation 
here since the size of a typical filial set is large relative to 0, In 
actuality less than 1 % of the file space is wasted even though up to 10# 
is allowed. About 30 % of the filial sets are split across an unnecessarily 
largo number of tracks thereby requiring extra accesses in retriev 
Whereas this percentage depends heavily on 0, simulations show it is 
practically independent of the distribution of filial set size, once this 

size oxceods track capacity. Overall, the document collection ano. its 68 

* 

profiles require 69 tracks of which 65 are allocated to documents, 
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Number of disk tracks/cy Under 


20 


Tracks/cylinder for data and ISAM indexes 


18 


Tracks/cylinder for overflow (10#) 


2 


Track size 


7294 bytes 


ISAM index size (track or cylinder) 


300 bytes 


Average record size 


level 1 - profiles (141 terms) 


660 bytes 


level 2 - profiles (?0 terms) 


376 bytes 


level 3 ~ documents (54 terms) 


312 bytes 



Parameters Related to Management of Disk Storage 

Table VII-1 






Property 



Store Items 
by Level 



Heir-Filial 

Order 



Threshold for wasted space 



( bytes) 



Level 1, profiles (13) 

Size of average filial set kbytes) 
Total storage (tracks) 

Level 2, profiles (55) 

Size of average filial set (bytes) 

- Total Storage (tracks) 

Level 3» documents (1500) 

Size of average filial set (bytes) 
Total Storage (tracks) 

Percent of disk space wasted 

Percent of filial sets requiring 
an extra access (100# « 69) 



?20 

8580 

1 

1616 

3 

8480 

65 

0.72 % 

30 # 



720 



8580 

1 

1616 

3 

8480 

65 

0 . 60 # 



Hierarchy Storage in Level and Heir-filial. Order 

Table VII-2 
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Pile management requires considering record retrieval as well as 
storage. For a clustered file accessed through ISAM, most of the retrieval 
options concern the handling of indexes. The conventions adopted here 
retain the cylinder index in core memory, once it is accessed, as well as 
the track index for the current cylinder (only). Each query is processed 
independently of others in the following manner. Initially the disk arm 

J 

(set of read/write heads) is positioned just outside the clustered file, 
as if serving another program. The search begins by obtaining the 
cylinder index, first track index, and all level 1 profiles. Thereafter, 
profiles and documents are accessed as nodes «c:e expanded; new indices 
are fetched as cylinder boundaries are crossed. Only one optimization 
technique is employed. Because several nodes may be expanded simultane- 
ously, it is possible to know, in advance, the next few desired records. 
When this is the case, it is assumed that records are obtained in a 
single sweep across the disk surface rather than by jumping forward and 
backward. 

Several measures of i/o activity are used, all being related to track 
and cylinder changes; the term access refers to either type of change. 
Actual timings or conversion of accesses to milliseconds are not made, 

ft 

since they can be misleading. For example, real time i/o delays are a 
function of the traffic volume in a computing system and are therefore 
quite variable, particularly for the IBM 2314- (l). Moreover delays vary 
with file size since a track change in a small file may correspond to a 
cylinder change in a big file, In this research the average number of 
disk accesses per query soarch is used as a measure of i/O activity, being 
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reasonably accurate and easily converted to milliseconds In specific 
computing environments. The averages axe subdivided Into the number of 
track and cylinder changes per hierarchy level and the number of tracks 
and cylinders traversed In each change (step size). This breakdown helps 
relate the results to files of different sizes or located on different 
storage devices . 

Search strategy has a major influence on the amount of i/o and con- 
sequently on the optimum storage sequence for a clustered file. Here, data 
from previous duster-oriented evaluation runs are used to simulate actual 
searches. First, all level 1 nodes are "accessed” from the simulated 
disk. Second, the top ranked nodes— as Identified in the. duster-oriented 
evaluation— are expanded by accessing their sons. Finally, the level 2 
nos*v;i are expanded In the order specified by their duster-oriented 
evaluation ranking and the appropriate documents axe accessed. Through- 
out this discussion, access implies fetching an item stored on the 
simulated disk while monitoring the track and cylinder changes. The 
number of expanded nodes is controlled to approximate the SMART narrow 
and broad searches used earlier , Thus, it is possible to approximate the 
i/o delays connected with the F-R curves for these searches (Figures V -38 
and V- 39 , P *(6 * -l) profiles). Tables VII -3 and VII -4 show how well 
tho expansion data for these tests (simulated searches) matches the actual 
SMART searches. The small discrepancies are due to the fact that SMART 
has complex expansion criteria (see Table IV-2) whereas the simulated 
search uses only cluster size. 

The extent to which the 225 Cranfield requests evenly cover Hierarchy 
1 is unknown. That is, some clusters may be examined a disproportionate 
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Property 


• 

Search Strategy , 
Narrow Broad 


Level 1 

Number of profile correlations 
Number of expanded nodes 


13 

1.7 


13 

3.1 


Level 2 

Number of profile correlations 
Number of expanded nodes 


8.0 

3.0 


13.0 

5.3 


Level 3 

Number of document correlations 


86 


162 


Average Expansion Characteristics of SMART Searches 
Hierarchy 1, P*(6 • -l) Profiles 
Table VII-3 


Property 


Search Strategy 
Narrow Broad 


Level 1 

Number of profile correlations 
Number of expanded nodes 


CO 

r-i 


13 

3.0 


Level 2 

Number of profile correlations 
Number of expanded nodes 


7.9 

2.9 


13.9 

5.6 


Level 3 

Number of document correlations 


85 


IS? 



Average Expansion Characteristics of Simulated Searches 
Using the Cranfield Query Set 

Table VII-4 
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number of times, To alleviate deviations from this source , a "simulated 
query set" is also used in these tests, although no queries are actually 
involved • Instead, for 598 trials, random, nodes are accessed on each 
level while track and cylinder changes are recorded. To be realistic, 
all level 1 profiles are accessed and randomly selected nodes are expanded. 
Thereafter, selections are limited to the sons of previous nodes,' Compar- 
ing the data in Tables VII-3 and VIII-5 verifies that the simulated 
queries behave like the actual Cranfield requests in that the same number 
of nodes are expanded in both searches. Corroborated results from both 
query sets gives additional confidence to the findings of these investiga- 
tions. 

To summarize, the following experiments try to relate i/o activity 
to P-R performance levels and to select an optimal storage sequence for 
a clustered file. The test procedure depends heavily on the accurate 
simulation of the indexed sequential access method. Briefly, the follow- 
ing steps are involved, 

a) Select parameters for record sizes and management of 
disk space (Table VII-l). 

b) Prepare the disk map using the proposed storage algorithm 
and either level or heir-filial item sequence (Table VII-2), 

c) Using either real or simulated requests and the expansion 
parameters of previous searches (Table VII-3), process 
each request while monitoring its simulated i/o activity. 
Processing includes obtaining cylinder and track indexes, 
profiles on level 1, and the sons of all expanded nodes 
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Property 


Search Strategy 
Narrow Broad 


Level 1 






Number of profile correlations 


13 


13 


Number of expanded nodes 


1.8 


3.0 


Level 2 






Number of profile correlations 


8.0 


13.1 


Number of expanded nodes 


3,1 


5.9 


Level 3 






Number of document correlations 


85 


164 



Average Expansion Characteristics of Simulated Searches 
Using Random Selection of Expanded Nodes (Simulated Queries) 

Table VII-5 
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thereafter. The primary evaluation measures include a 
breakdown of the number of accesses per hierarchy level 
(track and cylinder changes) and the average step size 
between these changes. 

The resulting data is used to evaluate file storage and retrieval options 
in the forthcoming sections, 

3, Test Results 

The previous section outlines the simulation and evaluation proce- 
dures for studying the i/O activity connected with searches of a clustered 
file, i»e results for Hierarchy 1 and P*(6 « -l) profiles are shown 
in Tables VII-6 and VII-7. Regarding terminology, the total number of 
accesses is the average number of track or cylinder changes per query for 
obtaining index, profile, and document data. This is separated into track 
and cylinder changes per level; the amount of data moved over between 
accesses (average step size) is included also. For example, a cylinder 
step size of 1.5 implies that an average of 1.5 cylinders is passed over 
each time a change is made. An analogous statement applies to track step 
size, but is obviously bounded by the number of data tracks per cylinder 

( 18 ). 

Bv nearly every measure, storing the hierarchy in order by levels 
appeals more economical for forward search strateg ies. The difference is 
access es for the narrow search and 3—^" accesses for t he broad search. 
The largest contribution to these differences comes from the nodes on level 
2, both in the number of track and cylinder changes and in their step 
sizes. To review, both storage sequences keep filial record sets 
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Property 


A 


. £ 

B 


learch Run 
C 


D 


Query Set 


Real 


Simulated 


Real 


Simulated 


Storage Sequence 


Level 


Level 


Heir-filial 


Heir-filial 


Total Number of Accesses 










Level 1 


2.00 


2.00 


2.00 


2.00 


Level 2 


1.39 


1.42 


2.82 


3.06 


Level 3 




6.62 


6.24 


A22 


Total 


9.73 


10.0 


11.0 


11.8 


Number of Cylinder Changes 








1.00 


Level 1 


1.00 


1.00 


1.00 


Level 2 


0 


0 


1.23 


1.18 


Level 3 


1.24 


1.22 


1.14 


1.36 


Step Size of Cylinder Changes 
Level 1 


1.00 


1.00 


1.00 


1.00 


Level 2 


0 


0 


1.35 


1.48 


Level 3 


1.59 


1.59 


1.52 


1.57 


Number of Track Changes 










Level 1 


1.00 


1.00 


1.00 


1.00 


Level 2 


1.39 


1.42 


1.59 


1.19 


Level 3 


5.10 


5M 


5.09 


5.36 


Step Size of Track Changes 








1.00 


Level 1 


1.00 


1.00 


1.00 


Level 2 


1.21 


1.27 


8.50 


7.50 


Level 3 


2.89 


3.04 


2.64 


2.84 



l/o Activity in a Simulated Narrow Cluster Search 

Table ’/1 1-6 
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Property 




Search Run 




A 


B 


C 


D 


Query Set 


Real 


Simulated 


Real 


Simulated 


Storage Sequence 


Level 


Level 


Heir-filial 


Heir- filial 


Total Number of Accesses 










Level 1 


2.00 


2.00 


2.00 


2.00 


Level 2 


1.88 


1.90 


4.36 


4.86 


Level 3 


11.5 


11.7 


12.0 


| 12 jJL 


Total 


15.4 


15.6 


18.3 


19.4 


Number of Cylinder Changes 








1.00 


Level 1 


1,00 


1.00 


1.00 


Level 2 


0 


• 0 


1.66 


1.74 


Level 3 


1.86 


1.85 


2.14 


2.51 


Step Size of Cylinder Changes 








1.00 


Level 1 


1,00 


1.00 


1.00 


Level 2 


0 


0 


1.23 


1.25 


Level 3 


1,08 


1.29 


1.43 


1,48 


Number of Track Changes 
Level 1 


1.00 


1.00 


1.00 


1.00 


Level 2 


1.88 


1.90 


2.70 


3.12 


Level 3 


9.64 


9.85 


9.81 


10.0 


Step Size of Track Changes 








1.00 


Level 1 


1.00 


1.00 


1.00 


Level 2 


1,24 


1.12 


7.99 


7.09 


Level 3 


2.45 


2.62 


2.67 


2.88 



I/O Activity in a Simulated Broad Cluster Search 

Table VII-? 
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together. In addition, order by levels places all nodes on the same 
level In adjacent disk locations. Heir-filial order keeps parent nodes 
and their sons somewhat close at the expense of storing structurally un- 
related nodes in separated areas. Consequently, heir-filial order should 
be more economical in forward, narrow searches since all data resides in 
a localized area. The choice of the better sequence should hinge on the 
relative frequency of narrow and broad searches in an actual operating 
environment. However, it is doubtful that a significant proportion of 
actual searches are narrower than those used here and thus able to take 
advantage of the economics of heir-filial order. Most searches involve 
expansions of one or more unrelated nodes and thereby benefit from storing 
the hierarchy by levels even more than shown here. 

If the file were larger than the 1400 Cranfield documents, the con- 
clusions should be roughly the same even though there are more nodes. 

With order by levels, there would be greater distance between parent and 
sons (more space for the read heads to travel), while heir-filial order 
still confines related data to a localized area. On the other hand, if 
heir-filial order were used, structurally unrelated nodes have even 
greate<.* separations and jockeying back and forth between them in a search 
is quite costly. For example, it is likely the large track step sizes in 
the tables would turn into steps over cylinders. Figuring both storage 
schemes are penalized equally, the choice of optimal order comes down to 
the frequency of extremely narrow searches. As mentioned, few searches 
are assumed to be as narrow as the one used here, so storage order by 
levels probably remains the better choice even for larger collections. 
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These conclusions hold for the case of forward search strategies. 
Backtracking or plunge-first strategies make a better fit with heir— filial 
order since it localizes structurally related records. In fact, the SMART 
searches use a small amount of backtracking and their i/o requirements 
are only approximated by the present forward searches. The actual i/O 
is a bit more than the figures quoted in the tables, say 2-3 additional 
accesses , 

It is instructive to compare these cluster searches with a full 
search in terms of their performance (normalized measures) and the number 
of correlations and accesses. Such a comparison is given Table VII-8, 
using the full search in Figure IV- 5 , the cluster searches in Figures 
V-38 and V-39, and the i/o data in Table VII-6 and VII- 7 . In general, 
the comparison shows that cluster searching achieves its primary goal of 
recovering a good share of the relevant documents at much less cost than 
a full search. The narrow cluster search results in NR-NP values which 
are 60r/ o -?0?o of those for a full search, but does 3 ^- 15 # of the work. 

A broad cluster search achieves of the full search NR— NP perfor- 

mance for lyCrZtyo of the effort. Clearly the quality of performance 
increases as the amount of work increases, but with diminishing returns. 
Assuming that l/O delays are the dominant factor in determining response 
time and that one access is made per second, then the cluster searches 
should finish in 10-16 seconds while the full search requires over a 
minuto. In any case, the actual computer cost and time delay should be 
predicted rather accurately by the combination of the number of correla- 
tions (CPU computation) and disk accesses (l/O and system overhead). 
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Narrow 


Broad 




Property 


Cluster 


Cluster 


Full 


Search 


Search 


Search 


Normalized Recall 


.63 


.70 


.88 


Normalized Precision 


.3? 


.44 


.61 


Recall Celling 


.32 


.46 


1.00 


Number of Correlations 








Level 1 - Profiles 


13 


13 


0 


Level 2 - Profiles 


8 


13 


0 


Level 3 - Documents 


* 86 


162 


. 1400 


Total 


' 10? 


188 


1400 


% of Full Search 


% 




10(# 


Number of Disk Accesses 


• 






Level 1 - Profiles 


2.0 


2.0 


0 


Level 2 - Profiles 


IA 


1.9 


0 


Level 3 - Documents . 


6*2 


11.5 


65.0 


Total 


9.7 


15> 


3576 


% of Full Search j 









Relation of Performance and I/O Activity for 
Cluster and. Full Searches 



Table VII-8 
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4. Summary 

The presont chapter describes a series of experiments related to the 

indexed sequential access method for managing a disk resident, clustered 

% 

file. Their results are as follows, 

a) For forward search strategies, storing the hierarchy by 
i successive levels gives the most economical searches under 
a variety of conditions . This finding is likely to hold 
for larger collections and search strategies with some 
backtracking also. 

])) Cluster searching can retrieve many relevant documents 
with much less system effort than a full search . In the 
test cases, a cluster search using 10-16 accesses achieves 
about ?0£ of the performance (NR, NP) of a full search 
requiring 6 $ accesses, 

A number of other issues related to physical file organization are 
discussed— handling of ISAM indexes and overflow space, space- time trade- 
offs in file storage, and evaluation measures for disk i/o. 
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Chapter VIII 

Experiments with a ^uery Alteration Scheme Based on a Cluster Hierarchy 
1. Introduction 

9 

Chapter III states that a clustered document file is economical only 
if it reduces search time or provides facilities not available in other 
organizations. It also outlines several alternate uses of a cluster 
hierarchy in an attempt to help justify its construction. Particular 
attention is given to the concept of associating a substitute (closely 
related term) with each base profile term . The result is a structure 
which combines the functions of a thesaurus for query expansion and a 
directory for file searches. The use of thesaurus classes (substitutes) 
in combination with clustering is new and holds two distinct advantages. 
First, since each node is equipped with its own set of substitutes, there 
is a unique opportunity for using term-term associations from a group of 
highly related documents (those beneath the node). Consequently, local, 
narrow term relationships are accommodated on lower hierarchy levels and 
broader, general relationships axe handled on upper levels. Second, 
substitutes can be used in several ways depending on which ones are 
applied and how they are matched. If term substitutes improve retrieval, 
then the utility of a clustered file is increased. 

This chapter seeks to establish the validity of a n automatic query 
alteration scheme using term substitutes. Con sideration is llmlted_to 
the TIED matching option mentioned in Section III ,8 . That is, only the 
substitutes of matching base terms may alter a auery -proflle correlation. 
The recall experiments employ substitutes from parent nodes; the precision 
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experiments use substitutes from nodes on the current level . In all cases, 
term substitutes are generated by the same algorithm, although certain 
minor changes are allowed. Briefly, for any given profile term (base) . 
the algorithm identifies its substitute as another term in the same profile 
which has the largest term-term correlation in the documents beneath 
that node . Consequently, the derivation could involve computing large 
similarity matrices. Fortunately, the magnitude of this task can be 
reduced. Specifically, the upper level matrices can be formed from those 
on lower levels. Furthermore, the previous experiments suggest that only 
the most important profile terms need be considered and the use of 
shortened profiles, P*($ « -l), In the experiments reduces the effort 
further. These reductions are important since most work with thesaurus 
construction and term-term relationships is greatly hampered by computing, 
storing, and handling large matrices. 

In the final analysis, the proposed alteration scheme does not 
Improve retrieval, at least for the options tested . However, it appears 
useful as part of a search feedback scheme. The concept of augmenting each 
profile with its own thesaurus (term substitutes) remains intriguing; in 
view of success in similar experiments with unclustered collections, 
further work is warranted, 

2. Deriving Base-Substitute Pairs 

Base substitute pairs are profile terms with maximum term- term 
correlation among the documents in a node's crown. For a detailed explana- 
tion of this concept, consider a hierarchy without overlap such as that 
in Figure VUI-la, For any node n with profile P, let T be the term- 
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document matrix representing the documents in its crown. As shown in 

th 

Figure VUI-lb, the element is the weight of the j concept in the 
i th document and a row corresponds to a complete document. Suppose 
an association matrix A « T*T is formed, where the * indicates matrix 
transpose. Then the cosine similarity between the i and j terms with 
respect to node n is 



hi - A ij 1 / A n A Jj 



(VIII-l) 



Once the matrix S is obtained, the substitute for the base profile term 



i is another profile term j such that 



s u ■ "« l s ik 



( S ik I k/i. >=- 1, 2 vj 



(VIII-2) 



In practice, there are more terms in T than in P because of term deletion 
in profile construction. Consequently, the association matrix A can be 
limited to those terms in P. This saves a considerable number of term- 
term correlations and does not alter the above computations. 

The real problem is finding an economical algorithm for obtaining 
the similarity matrix S for each node. The difficulty is that T is avail- 
able only by rows (document vectors) while direct calculation of S 
requires its columns. Transposing T to make its columns accessible re- 
quires a. moderate amount of computation; this operation is equivalent to 
Inverting the set of document vectors to get term vectors. Another algori 
thm for obtaining A is to accumulate its elements as documents are input 
and to avoid forming T* altogether. The following steps would be 
executed} 

a) initialize A » 0; 

b) read the next document D « (d^, d 2 , . . ., d y ) ; 
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Level 0 
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Level 2 
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a) Sample Cluster Hierarchy 



terms 

1 2 i . j , . • v 




term vector T 



•M 



document 

vector T. - 
1 * 

weight of term 
j in document i 



b) Term-Document Matrix 




c) Association Matrix 



Figure VIII-1 
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c) set j m j “t" dj^dj for if J " 1# 2, « « . , V} 



1***1 



d) repeat steps b) and c) for all documents In the crown 



of node n. 



For a small number of profile terms, either technique is acceptable. 

For a moderate number of terms, accumulation is preferable since each 
document is handled only once. However, if the association matrix cannot 
be made core resident, the inversion technique must be used regardless 
of other considerations. 

The similarity matrices for the lowest level nodes can be obtained 
economically by one of the above algorithms since only a few documents 
are involved in the computation. Matrices for upper level nodes could be 
derived in the same way, but with considerable duplication of effort. 

The fact is that the association matrices on level i are calculable 
directly from those on level i + 1, To illustrate, consider node n, and 
its sons n_ and n_ in Figure VIII- la. Since there is no further need to 

^ J 



document- term, association, and similarity matrices in conjunction with 
node n fc . Using the assumption that the hierarchy contains no overlap, T^ 
can be viewed as partitioned so that 



If Ag and A~ are saved in magnetic tape files, for example, then can 





i » 
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be obtained by summing corresponding records in each file. 

In practice, a clustered collection contains overlap. This can be 
dealt with also. In the example, let U 2 and denote the documents 
unique to nodes n 9 and n respectively, and let X denote their common 
documents. Then 





1 ITT- 




u 



A i ” T i T i "^ 2 1*' I d 3> If 
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- u 2 u 2 + x'x + u^u 



A 2 + A - X X 



Thus, the effects of overlap can be removed by subtracting a small 
correction matrix made from the overlapping documents. It is not neces- 
sary to actually compute X X since corrections can be made in place as 
described in the accumulation technique above. In the case of a document 
having membership in several clusters, the correction must be applied 
each time the overlap occurs. Overlap is easily detected if each tape 
file (association matrix) is preceded by a list of documents used in its 
preparation. Merging and checking these lists discloses overlap and 
allows for corrections. Complete flowcharts for finding base-substitute 
pairs are shown in Figures VIII-2 and VIII-3. Following the previous 
example, it is assumed that all documents reside on disk and that extra 
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S, is the matrix of cosine term-term similarities 
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Figure VIII-2 
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Formation of Term Substitutes on Upper Hierarchy Levels 
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tape or other storage is available as needed. 

The discussion above suggests that corrections should be made for 

the effects of cluster overlap via equation VIII-4. It can be argued that 

* 

a small amount of overlap could be neglected since it has little effect 
on the final similarity matrix. However, it may not be desirable to cor- 
rect for overlap on the grounds that terms in these documents characterize 
several topics (clusters) and thus should receive additional emphasis 
when determining the substitute sets for upper level nodes. Generally 

t 

speaking, neglecting the correction factor -X X in equation VIII-4 in- 
creases similarities when both the i and j terms occur in over- 
lapping documents. Consequently, the desired similarity increases are 
made and the computation of association matrices is simplified at the 
same time. 

This section presents several methods for deriving base-substitute 
pairs. It is a bit difficult to quantify the savings from using the 
computation schemes suggested here. However, if a typical shorten- 
ed profile contains 20& of terms represented in its term-document matrix 

T then its association matrix contains only 4# of the entries in the full 

« 

matrix T T. The fact that upper level similarity matrices can be obtained 
from those on lower levels, clearly saves substantial sort ox* calculation 
time regardless of how overlap is handled. In the following experiments . 
substitutes are obtained for each term in =* -1). profiles for 

Hierarchy 1 . In all cases, similarity matrices are computed by inverting 
the documents in a node's crown and finding the cosine correlations 
directly. Corrections are always made for cluster overlap. Several 
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complete sets of substitutes axe generated by applying different 
frequency and other restrictions to the participating terms, Each set 
consists of 69 groups of base- substitute pairs — 55 for the nodes on level 

2, 13 for the nodes on level 1, and one for a dummy node on level 0 as 
illustrated in Figure VIII-1. Since there is no profile corresponding 
to the dummy node, the terms used for its base-substitute pairs are the 
most frequent keywords in the collection, up to 20 % of the entire 
vocabulary, 

3, Term Substitutes as Precision and Recall Devices 

Previous sections outline various ways of using term substitutes in 
query searches. When employed as precision devices, substitutes of one node 
help distinguish it from all others by altering correlations which involve 
matches on both a base term and its substitute . Presumably it is reason- 
able to Increase the correlation since l) the base and substitute terms 
are highly related in the documents beneath the node and 2) both terms 
are used in the request. The intent, then, is to improve precision by 
giving more emphasis to combinations of matches on existing query terms 
rather than by adding new or related terms as in a normal thesaurus expan- 
sion. The specific algorithm for modifying a correlation contains the 
following steps i 

a) determine the set of base profile terms B which match 
the query* 

b) using the terms in B, compute an initial cosine correlation 

between the query and profile* 
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c) determine the set 3 CIB whose elements are base terms and 
their substitutes provided that both are already elements 
of B; 

d) using the terms in B , compute an additional cosine 
correlation C 2 between the query and profile; and 

e) compute the final correlation value C « + eC 2 where 

e is an experimentally determined emphasis factor . 

Clearly i the case of e ■» 0 is the same as not using substitutes; e >0 
gives higher correlations to vector matches involving bases and their 
substitutes; e <0 does just the opposite* Another explanation and an 
example of this algorithm is given in Section III,8,B, Using the termi- 
nology developed there, substitutes are obtained from nodes on the 
current hierarchy level rather than previous levels and they are TIED to 
their bases since matches must occur on both a base and its substitute 
before a correlation is altered. 

The precision experiments involve three sets of substitutes for the 
base terms on level 1 of Hierarchy 1 (P*(6 * -l) profiles). Each set 
is generated by the procedure outlined in the previous section, but sub- 
jected to one of the following restrictions) 

a) no restrictions; 

b) only base profile terms of medium frequency are permitted 
to have substitutes; or 

c) the correlation between a base and substitute must lie 
in the interval ( 0 , 45 , 0 , 75 ) • 

The intent of these restrictions is to alleviate effects from terms of 
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little Importance and those having chance relationships with other terms. 
Figure VTII-4 contains the PF~RC evaluation curves for searches with and 

without the use of substitutes as precision devices. Unfortunately for 

% 

the options tested, substitutes do not help discriminate among nodes. 
Neither frequency nor correlation restrictions nor changes in the emphasis 
factor (e) have much ability in raising the overall performance to that 
of a search mode without substitutes. It is not the case that all re- 
quests do poorly; a substantial number do slightly better and a few see 
spectacular improvement. However, the overall trend is on the downward 
side. For the case of e ® 1 correlations are probably dominated by 
matching bases and substitutes since the weight of both terms is 
effectively doubled (see Chapter V). For e * the correlation probably 
depends too much on random term matches. Although e » 0 is certainly not 
an optimal value, these tests indicate there is little to be gained from 
using substitutes as precision devices. 

More frequency, a thesaurus or term substitutes are employed as recall 
devices, that is, as a source of new keywords for broadening the scope of 
a request. The additional terms increase the opportunities for query - 
profile matches and therefore result in higher recall searches . Generally 
a thesaurus contains term relations appropriate to the entire collection 
(l). However, with clustered documents, substitutes may be associated 
with each node (small set of documents) and requests can be altered in a 
selective careful manner. In the next experiment, the only substitutes 
influencing a profile correlation are those in the profile's parent nodes; 
and this set is limited further to those substitutes whose bases have 
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P*(6 e -i) profiles, Hierarchy 1, level 1 
Figure VIII -4- 
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already watched the request. Obviously, nodes on level 1 have no parent; 
in this case, the substitutes In the hierarchy's dummy node are used as 
suggested in Section VIII. 2. For a precise illustration of the correla- 
tion procedure, consider a search which expands only one node on level i. 
When the request is matched with this node's profile, a set of matching 
base terms B is obtained; let S be the substitutes for these terms which 
do not already appear in the query. Each element of S is assigned a weight 
w and temporarily appended to the request. Using the broadened query, 
the following matching procedure is applied to each subordinate profile 

of the expanded node (those on level i + l) * 

a) compute an initial profile-query correlation without 

using the substitutes in S; 

b) determine a set of substitutes SCS whose bases took 
part in the correlation C^; 

c) compute a secondary correlation value C^Cw) using only 

I 

the substitutes in S ; and 

d) compute the final correlation C ** + ^(w), 

C is clearly a function of w since the substitutes have pre-assigned 
weights. Clearly w « 0 is the case of not altering correlation values 
and as w Increases so does the importance of the terms used to broaden 
the request. The above procedure is applicable to searches which expand 
more than one node; it i 3 a bookkeeping matter to determine which sub- 
stitutes are to be used. The intent of this seemingly complicated pro- 
cedure is the careful addition of new request terms from the parent nodes. 
Furthermore the new terms (substitutes) have an opportunity of affect 
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correlations only when there is a match on the corresponding base (TIED 
option) . An example of this scheme is given in Section III.8.B. One 
additional remark is required for a complete description. The substitutes 
added to a request have a temporary existence. That is to say they are 
replaced by substitutes from expanded nodes on lower levels. Thus the 
request is altered more and more selectively as the search proceeds and 
the query vector does- not become unnecessarily long. 

The paragraphs above describe how term substitutes are used as recall 
devices in these experiments. As before, Hierarchy 1 and P^(6 = -1) 
profiles provide the test collection. Two different sets of substitutes 
are employed, the first of which has no restrictions. In the second set ( 
only medium and high frequency profile terms are permitted to have sub- 
stitutes and then only if the correlation between each base and substitute 
exceeds 0.25, Figures VIJI-5 and VXII-6 show the PF-RC evaluation curves 
for searches with and without the use of substitutes as recall devices. 

In each case, less favorable performance is obtained when terms are 
added to requests. A smaller performance loss is noted when frequency 
restrictions are applied, mostly because there are fewer cases in which 
correlations are altered. Neither changes in the restrictions on the 
substitute set nor in the weight of terms added to request increase the 
recall level as desired. Again, it is not the case that all requests do 
poorly; some show great improvement. Consequently, these techniques 
could be used as an automatic feedback procedure, particularly when no 

relevant documents are retrieved in an initial search. For example, the 
use of substitutes Improves the performance of about half the requests 
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for which no relevant are contained in the 2 top-ranked clusters in a 
normal search. 

One problem in both the precision and recall experiments is the 
modest number of cases in which substitute terms actually influence 
correlations. The number of cases could be increased, in part, by switch- 
ing to the UNTIED matching option, that is> giving substitutes the same 
opportunity for matches as any other term. Another problem is that in 
many instances, the same base-substitute pair occurs in many nodes arid 
thus gives little discrimination. A partial solution to this may lie in 
placing different restriction on the terms allowed to have substitutes 
and the strength of associations among terms. However, it may be that 
no substitutes can adequately discriminate among profiles because they 
represent a large number of documents. This might explain why more 
success occurs when term- term relations are used to distinguish individual 
documents from each other as in the experiments by Jones (l, 2). Of 
course, these experiments are not intended to address this broader question, 

4. Summary 

This chapter describes a. query alteration procedure based on sets of 
term substitutes associated with each node of a cluster hierarchy. The 
substitute for each base profile term is another profile term which Is 
strongly related to the base in the documents beneath the node under con- 
sideration. Depending on how they are applied, base- substitute pairs 
function as either precision or recall devices. Unfortunately for the 
options tested here, the use of substitutes result In performance losses 
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rather than plains, Consequently. substitutes should not be, Included In 
profiles In their present form . They might be used profitably as part of 
a feedback procedure or presented to a user browsing through the collec- 
tion, however. In spite of the results, the basic idea of combining a 
thesaures and a profile is appealing since both are tailored to the con- 
tents of a specific group of documents. 

Two significant contributions of this chapter relate to the computa- 
tion of term-term similarity matrices. It is shown that a complete 
matrix can be obtained from a set of smaller matrices representing only 
part of the collection. Furthermore under certain conditions, it is 
unnecessary to invert the document set to obtain term vectors required in 
the calculation. Both of these techniques provide reductions in the 
amount of computer time for generating similarity matrices. 
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Chapter IX 

Comparison of Inverted and Clustered Document Piles 
1* Introduction 

The most widely used file organization in document retrieval systems 
is the inverted organization. Chapter II discusses the principles of this 
technique and the space-time tradeoffs involved. The basic idea is to 
construct a data ba.se which lists a document under entries for each of 
its index terms. The file is inverted in that it is maintained in term 
order rather than document order. Actual implementations generally use 
a combined file approach. That is, documents are stored consecutively 
on disk, but in such a way as to be individually accessible, for example, 
through their accession number. In addition, an inverted directory is 
constructed with one entry per vocabulary term; each entry is a list of 
accession numbers of documents containing that term. An alternate scheme 
might place disk addresses in the directory rather than accession numbers. 
In either case, the search program computes correlations with all docu- 
ments on directory lists corresponding to query terms. In some instances, 
including “within document weights" in directory entries allows correla- 
tions to be accumulated during the directory scan; therefore, only the 
highest ranking document citations are ever taken from storage. This is 
the case for the cosine function; consequently the directory scan is of 
primary importance in the comparisons made here. 

The purpose of this chapter ist 

a) to compare the storage requirements for inverted and 
clustered files. 
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b) to examine the search speed, of an Inverted directory a3 
function of query length and collection slze L and 

c) to compare the speed and effectiveness of Inverted and 
cluster searches . 

The test procedure is similar to that in Chapter VII. Briefly, disk 
storage and retrieval is simulated for a given collection of documents 
and requests while monitoring simulated i/o activity. Data is tabulated 
by query length or by hierarchy level in order to make the necessary 
comparisons. The tests result in the following conclusions. 

a) The inverted organization requires twice as much storage 

space as a clustered file in order to provide equivalent 

» 

retrieval services . However, If relevance feedback or 
document space modification are not included in the 
system, both file organizations require about the same 
amount of storage. 

b) Search time (disk accesses) in an Inverted file increases 
with query length (linearly), collection size, and the 
number of documents retrieved . The directory scan re- 
quires 1-2 accesses per term, depending on the collection 
size. Search time in a clustered file is basically a 
function of the number of expanded clusters (i.e., search 
strategy), and somewhat Independent of query length and 
number of retrieved items, 

c) For specific number of disk accesses, the Inverted file 
search retrieves a fixed number of documents and achieves 
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high precision at a specific recall level. For “the same 
number of accesses, the cluster search provides the user 
with many or few documents. Generally the precision Is 
less, but the recall level may be higher or lower depend- 
ing or. the number of retrieved documents . 

Naturally, these conclusions are subject to the assumptions, conditions, 
and search strategies in the various tests. Neither scheme is superior 
on all points, each one has its own strengths. The least that can be 
said is that the clustered organization uses no more storage space and 
provides more flexible searches. That is, searches can be quick and 
inexpensive, thorough and costly, or aimed at high or low recall. How- 
ever, they are less precise than inverted searches, in most instances, 

2. The Inverted Directory— Storage and Search 

In the following experiments the inverted organization is a combina- 
tion of two files. The first is a consecutive disk file of all documents 
including citation data, index terms, and weights. The citation is 
retrieved for user printouts while the berms and weights are required 
only if relevance feedback or space modification is included in the 
system capabilities. (See Chapter II for a discussion of this point.) 

The second part is the Inverted directory containing one entry per vocabu- 
lary term. Each entry is a list of accession numbers of documents con- 
taining a given term and the "normalized term weights" within those 
documents. For example, a document D with term i of weight adds the 
following data to the directory list of that term* 



o 

ERIC 



. 314 



IX- 4 



Term i 




Accession number of document 



4 

D * 




Normalized weight of term i in D 

Given a query vector Q » • where q^ is the weight of term i, 

document correlations are accumulated one term at a time in the following 

cos(*,D) 4X-1) 

from query vector 
from inverted directory entry 
Since each directory list contains data pertaining to many documents, 
sufficient core storage must be available to hold the partial sums related 
to each document. After accessing all appropriate lists and accumulating 
sums as illustrated, correlations are sorted and citations retrieved and 
printed in decreasing order of similarity, A complete example of this 
process is given in Figure II-3* 

The precision-recall data for a search using an inverted file is the 
same as that for a full search since correlations are computed for all 
documents D such that C0S&,D) 0, The number of correlations and disk 

i 

accesses is much less than for a full search. In particular, the number 
of accesses is proportional to the query length (directory scan) and the 
number of documents the user wishes to view. The remainder of this 
section considers only the i/o activity in the directory scan, however. 

In order to determine how directory scan time varies with query 
length, a simulation is made similar to those in Chapter VII, That is, 
disk storage map is constructed containing the track and cylinder locations 
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of Inverted directory records as though they were actually stored on 
disk. During the simulated searches, the number of disk accesses Is 
calculated from the changes in track and cylinder locations as directory 
lists are obtained. Averaging this data and plotting it versus query 
length shows the desired relationship between l/o activity and the number 
of query terms. The storage map is made assuming either (l) indexed 
sequential access (ISAM) to records or (2) direct access to records, for 
example, through a scatter storage scheme. In either case, each track of 
the simulated disk is treated as one physical record and the inverted lists 
(records) are packed onto It using the storage algorithm in Section III ,6, 
This algorithm balances the amount of wasted disk space and the splitting 
of records on an unnecessarily large number of tracks (thereby requiring 
extra accesses during retrieval). Where necessary, space is traded for 
time only if the waste is less than a threshold amount (0 bytes per track). 
The parameters for the storage algorithm are shown in Table IX-1 and are 
the same as those used in tests with clustered files. 

The size of each inverted directory list is calculated from the 
number of uses of the corresponding term. It is assumed that 4 bytes of 
storage are sufficient to hold a document accession number and normalized 

weight and that of each list is preceded by 20 bytes of header information, 

. . ' 

Thus, since t9rm number 4155 appears in 138 documents, its directory list 
requires 20+4*138 « 572 bytes. When the storage map is actually made, 
directory records are stored in alphabetical order by keyword. In the 
case of ISAM access, cylinder and track indices are interspersed at 
appropriate locations, Table IX-2 shows storage statistics for the 
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Number of data tracks/cylinder 18 

Track size 7294 bytes 

Threshold for wasted space (0 bytes/track) ?20 

ISAM index size (track or cylinder) 3 00 bytes 

Direct access index size 0 



Parameters for the Management of the Inverted Document Pile 

Table IX-1 



Property 


1400 

Documents 


14000 

Documents 


Number of inverted directory lists (records) 


5030 


5030 


Storage for record header (bytes) 


20 


20 


Storage for accession number and weight 

(bytes) 


4 


4 

* . ^ 


Average total record size (bytes) 


80 


620 


Total directory size (tracks) 


57 


432 


Percent of disk space wasted 


1 # 8 $ 


1.3& 


Percent of records requiring an extra 
access (100# * 5 ^ 30 ) 

1 


0,0^ 


3.3$ 



Results of Storing the Inverted Directory 

Table IX-2 
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inverted directories of two collections. The smaller collection repre- 
sents the actual Cranfield documents. The larger collection approximates 

what the Cranfield collection might look like after a tenfold size in- 

% 

crease. Assuming each term continues to occur with its present relative 
frequency, each list is simply 10 times larger than before. This is 
somewhat tenuous when applied to individual terms, but might be ah ade- 
quate overall approximation. The new terms that would enter the vocabulary 
are not considered here. The table data shows only a ±%- 2 & waste of disk 
space for either collection even though up to 10# is allowed (720 bytes/ 
track). Because the inverted lists are longer in the larger file, the 
storage algorithm splits a greater percentage of them over an unneces- 
sarily large number of tracks. This increases retrieval time significantly 
since the lists which are split usually correspond to frequent document 
and request terms. Overall, the Inverted directory for the 1400 Cranfield 
documents requires 57 disk tracks. Using the complete combined file takes 
smother 62 tracks for document vectors (citation data, terms, and weights) . 
If relevance feedback and space modification are not desired, the con - 
secutive file could be limited to citation data only. For the Cranfield 
collection this requires about lh disk tracks so the total storage space 
under the inverted organization is either 119 or 71 disk tracks (a 92# 
or a 15/6 overhead) . By comparison, the entire clustered file in Chapter 
VII uses 69 tracks (II# overhead) and provides for feedback, space modi - 
fication, and more flexible searches . 

To collect data on the i/o activity during the scan of the inverted 
directory, simulated searches are made using the Cranfield query set. For 
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each query, the disk arm Is "positioned 11 just outside the directory and 
track and cylinder changes are monitored as each directory list is 
accessed. For the ISAM case, the first access picks up the cylinder and 
first track indices. Thereafter, moving to a new cylinder includes an 
additional access to obtain the necessary track index. For the case of 
direct access, it is assumed that either no indices are required or that 
they are permanently core resident. This may or may not reflect a situa- 
tion that can actually be implemented. The intent is simply to discover 
how much overhead is involved in obtaining the ISAM indexes. In both 
cases one optimization technique is employed. Because the query vector 
identifies all the directory lists to be accessed, it is assumed that 
the lists are obtained in a single sweep across the disk surface rather 
than by Jumping forward and backward. During the simulated searches, 
the evaluation data collected 1st 

a) the average number of accession for queries of a given 
length) 

b) the average number of track and cylinder changes for 
queries of each length) and 

c) the average stepsize per change (number of tracks or 
cylinders traversed per change). 

Figure XX-1 shows how the average number of disk accesses in an inverted 
directory scan varies with query length for both the large and small 
collections. The tests use all 225 requests; however, only a few 
requests have more than 13 terms, so the points at the right hand end 
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of the curves are somewhat less reliable than the others. In all cases 
there Is a linear relationship between query length and the average number 
of directory accesses. For 1400 documents, each Query term requires 
0. 9-1.0 disk accesses (average)} overall, a penalty of 2.7 accesses is 
paid for using ISAM. The amount of i/O Increases considerably for the 
larger collection, roughly doubling when 10 times as many documents are 
added. In this case each term results In 1.6-2 , k accesses and the ISAM 
overhead is *f-10 accesses . Other collections probably exhibit a similar 
linear relationship as well as i/O requirements which increase as those 
observed here. It must be remembered, however, that these figures apply 
only to the directory scan and not the entire retrieval process. In actual 
searches, additional accesses are required to obtain and print citation 
data. 

The fact that scan time in an inverted directory increases with query 
length has good and bad aspects. On one hand, it supplies a convenient and 
reasonably equitable scheme for recovering search costs, namely by . charging 
a fixed dollar amount for each query term. On the other hand it is dif- 
ficult to obtain an inexpensive high-precision search, since more accurate 
searches generally require a moderate number of query terms and thereby 
incur greater costs. This is especially unfortunate when relevance feed- 
back is used since this process expands a request considerably. Consequent- 
ly feedback searches could become quite costly. From the users standpoint, 
it is more satisfactory to separate query formulation and search conditions. 
For example, a users first task should be to prepare a complete, accurate 
statement of his information needs. Only then should he consider the amount 
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□ 


ISAM 


14,000 


23.3 


A 


Direct 


14,000 


16,6 


O 


ISAM 


1,400 (Cranfield) 


10.9 


O 


Direct 


1,400 (Cranfield) 


8.2 



Inverted Director i/o (Disk Accesses) as a 
Function of Query Length 



o 

ERIC 



Figure IX-1 

. 321 




IX- 11 



of desired output, dollar cost, and other constraints on searching. Un- 
fortunately, this type of separation is difficult to provide with an invert 

ed file since the query formulation controls the search strategy, to a 

* 

large degree. 
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3. Comparison of Inverted and Clustered Document Files 

This chapter section compares inverted and clustered document flies 
with respect to search speed and quality of retrieval . The previous section 
considers storage requirements and shows that the inverted scheme needs 
about twice the space as the clustered scheme if a combined file is used. 
Regarding quality of retrieval, the precision-recall yalues from an invert- 
ed file search are the same as those for a full search. In a clustered 
organization, precision-recall data vary with the search strategy; compar- 
isons here are based on the narrow and broad searches of Hierarchy 1 
(p* (5 ** -l) profiles) as described in Chapters IV and V. Figure IX-2 
shows P-R plots for these searches; following previous practice, the points 
depict document level averages computed at cutoffs of 5» 10, 15, 20, 30,50, and 
75. It la seen that at every point, the inverted curve lies significantly 
above the curve for either cluster search* 

What remains is to determine the i/o delay associated with each point 
of the curve and to evaluate the combined sets of data. To obtain data on 
I/O activity, searches are simulated for the clustered and inverted files 
as described in Chapter VII and Section IX. 2, respectively. It must be 
emphasized that even though storage and retrieval are simulated, the experi- 
mental parameters are based on properties of the actual Cranfield collections, 
SMART search system, and IBM 2314 Disk Storage Facility. It is believed 
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Symbol 

o 

□ 

A 



"i 1 1 1 i • 1 

.10 .20 .30 .40 .50 

Recall 

Search Description 

Inverted File (Same as Full Search) 

Clustered File— Broad Search 
Clustered File— Narrow Search 



Comparison of Precision-Recall Bata From 
Inverted and Clustered File Searches 

Figure IX-2 
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the results accurately predict actual searches made under the specified 
conditions. 

Since a comparison among file organizations is difficult to make, it 
is helpful to state the underlying assumptions. 

General Assumptions 

a) ISAM access to all data items. Both organizations would 
benefit from direct access, l.e. use of absolute disk 
addresses, but it is unrealistic to assume use of direct 
access in operational systems. 

b) freshly constructed files . This implies all directory 
lists and clustered items are physically contiguous in 
storage rather than having "tacked on" elements due to 
updating. 

The following discussion describes additional considerations for the 
individual organizations. The i/o data collected in the simulations are 
summarized in Table IX- 3 along with the P-R data shown previously. It will 
be helpful to refer to this summary as the discussion proceeds. 

First, consider the inverted search. The number of disk accesses in 
the directory scan depends on the query length; a typical query contains 9 
terms and therefore requires 11,3 accesses (See Figure IX-l) , The i/O for 
retrieving citations depends on the organization of consecutive file and on 
the number of citations printed. Recall that the consecutive file in the 
inverted organization consists of complete document vectors, whereas only 
the citations are required at the end of a search. Consequently the file 
might actually consist of two distinct sub-files~one for citations and one 
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for terms and weights. The higher density of citations per track results 
in fewer disk accesses when the subfile scheme is used* Table IX-3 
Includes data for both approaches* Another problem in collecting this 
data is caused by the fact that the identifiers of all retrieved, nor- 
relevant documents are unknown. To circumvent this situation all documents 
are assumed to have an equal probability of being retrieved and data is 
based on random selections from the file. Even under these conditions, 
it is quite probably that the simulated i/O is close to its true value. As 
an example from the table, consider a typical Cranfield request having 9 
terms and ? relevant documents, A search retrieving 10 documents (2 
relevant 8 non-relevant) would require a total of 19 or 24 accesses depend- 
ing on whether a citation subfile is used. In either case, 11 accesses 
are spent in the directory scan and the remainder in obtaining document 
citations (hero, arbitrarity selected), A summary of the additional 
assumptions for the inverted file search includes 

a) 9 terms per request (the average for the Cranfield queries); 

b) sufficient core storage to hold all partial correlations 
during the directory scan; 

c) equl-probable retrieval of all documents . 

The cluster search assumes the same file and search parameters as in 
Chapter VII, namely Hierarchy 1. P*( 6 ■ -l) profiles stored by levels, and 
the narrow and broad strategies used throughout the research . As shown in 
the table, the narrow search requires an average of 9.? accesses per query 
for correlations while the broad search uses 15,4 accesses per query. In 
many cases, enough core storage is available to contain the citations of 
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Inverted Pile Search 



Retrieval 

Cutoff 


Recall 


Precision 


Directory 

Accesses 


Total Accesses/Query 
2-Subfiles No Subfiles 


5 


.216 


.266 


11.3 


16 


19 


10 


.318 


.218 


11.3 


19 


24 


15 


.377 


.186 


11.3 


21 


28 


20 


.429 


.172 


11.3 


22 


31 


30 


.493 


.148 


11.3 


24 


38 


50 


.563 


.126 


11.3 


25 


48 


75 


.634 


.144 


11.3 


26 


57 



Clustered Pile - Narrow Search 

Retrieval Correlation Total Accesses/^uery 



Cutoff 


Recall 


Precision 


Accesses 


No-rescan 


Rescan 


5 


.156 


.190 


9.7 


9.7 


16 


10 


.208 


.130 


9.7 


9.7 


16 


15 


.239 


.102 


9.7 


9.7 


16 


20 


.261 


.O83 


9.7 


9.7 


16 


30 


.284 


.063 


9.7 


9.7 


16 


50 


.307 


.042 


9.7 


9.7 


16 


75 


.316 


.029 


9.7 


9.7 


16 



Clustered Pile-Broad Search 



Retrieval. 



Correlation 



Total Accesses/Query 



Cutoff 


Recall 


5 


.185 


10 


.245 


15 


.286 


20 


.311 


30 


.349 


50 


.406 


75 


.429 



Precision Accesses 

.216 15.4 

.151 15.4 

.121 15.4 

.101 15.4 

.077 15.4 

.055 15.4 

.039 15.4 



No-rescan 


Rescan 


15.4 


27 


15.4 


27 


15.4 


27 


15.4 


27 


15.4 


27 


15.4 


27 


15.4 


27 



Comparison of i/O Requirements in Inverted and Clustered File Searches 

Table IX-3 
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all documents to bo retrieved, thereby avoiding a rescan of level-3 items. 

For example, if the user requests 30 documents, the search program might 
always maintain in core the citations of the JO documents with the highest 
correlations. As the search progresses, higher scoring documents are added 
to this "active" list and others are deleted. If memory space is limited, 
just the highest correlations need be kept along with the corresponding 
document accession numbers. At the end of the correlation phase, citations 
are obtained from disk. In most cases, this rescan requires a considerable 
number of additional disk accesses (6.3 or 11*5) accesses depending on the 
search strategy). For example, suppose 5 clusters are expanded, resulting 
in 150 document correlations, If 30 documents are returned to the user, 
they will undoubtedly come from all 5 clusters. Hence fetching their 
citations involves about the same amount of I/O as the initial scan of 
level-3. Consequently an additional economy in a clustered file is realized, 
when a reasonable amount of core storage is available , Data for both 
cases— no rescan and rescan— is shown in Table IX- 3 « It would appear that 
a similar procedure would apply to the inverted search. This is not the 
case, however, since correlations are accumulated and none can be discarded 
until the very end. Saving all citations would require a prohibitive 
amount of core storage. 

Contrasting the i/o requirements reveals that the inverted search 
obtains its superior P-R performance with significantly more disk accesses 
than a cluster search . Consequently, the response time to an on-line 
user is expected to be greater. However, the clustered file provides 
either high recall or low recall searches for approximately the same number 
of accesses. The inverted file gives a single type of search, but at 
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higher precision. Of note Is the fact that* the scan of the Inverted 
directory or profile hierarchy takes approximately the same effort, depend- 
ing on the search strategy. The essential difference in the schemes Is 
that at some point, the Inverted search makes random accesses into the 
data base for individual items, for example, to obtain citations. The 
cluster search also makes random accesses, but only for groups of documents . 
Its economy results from having concentrated, in a few locations, all 
documents having a high probability of being relevant. Furthermore this 
economy is likely to remain or even increase in larger collections. For 
example, a cluster hierarchy need not grow in direct proportion to collec- 
tion size, but an inverted directory must increase proportionately in 
order to maintain updated term entry lists. In addition, a larger collec- 
tion implies relevant documents are spread over more disk space in an 
inverted file since documents have more or less arbitrary locations. 
Clusters, however, retain their high density of relevant to a large extent. 
These are at least two reasons why a clustered file should be superior on 
a larger collection also. 

However, it cannot be denied that an inverted file gives higher 
precision searches whereas a clustered file is more economical of storage 
space and provides more flexible searches. The ideal situation is to com - 
bine both Schemes, l,e, provide an Inverted directory to clusters of docu - 
ments . This might be feasible if the directory size could be reduced to 
10$ of collection size, for example. The problem, naturally, is to make 
an accurate differentiation among clusters on so little information. This 
is exactly the problem considered in Chapter V where profiles are found 
to provide an adequate solution. 
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4. Summary 

Thla chapter compares the Inverted and clustered file organizations 
for the Cranfleld collection on the basis of storage requirements, search 
speed, and quality of retrieved output and attempts to generalize the 
findings to larger document collections. As depicted here, the inverted 
organization consists of a directory and consecutive files with a storage 
overhead of 15# or 92£. The latter figure applies if relevance feedback 
or space modifications is Included in the system. The clu stered file Incurs 
an 11% storage overhead. 

The search time (cost) and quality of retrieved output are liarder to 
compare "because they are interdependent. For Inverted fi les, search time 
is a function of query length and the number of retrieved docume nts. For 
a given query, the directory scan takes a fixed number of dis k accesses 
(about 1 access per term for the Cranfleld collection) . Thereafter, the 
search cost and precision-recall values are determined by the number of 
documents retrieved. (See Tables IX-2 and IX-3.) Presumably high precision 
or feedback searches are expensive since they involve requests having 
many terms and thus spend a lot of time in the directory scan. High re- 
call searches are costly because they generally require accessing a large 
number of documents from arbitrary disk locations. The clu ster search uses 
a number of disk accesses -proportional to the number of expanded clusters . 
Search time is not dependent on the length or complexity of a request, but 
on the broadness or narrowness of the search strategy * Profile correlations 
are an overhead cost and serve to select several general areas of the disk 
from which to begin retrieving documents. Once this phase is completed, a 
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search at any recall level may be obtained, for about the same cost, 



simply by retrieving additional documents . The overall precision is less 
than in the inverted search, however. 

As noted earlier, the economics of the clustered file organization 
should be present and perhaps more apparent in larger files . There is 
every reason to believe that the profile hierarchy grows slower than the 
inverted directory, since each new document does not necessarily increase 
the size of any profile. In addition, with limited re-clustering, it 
should be possible to maintain reasonable groupings of relevant documents 
and, therefore, quick, successful searches. 
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Chapter X 



Summary, Conclusions, Discussion, and Suggestions for Future Work 

There are several ways to summarize this research and its importance 
to information retrieval. A simple listing of chapter contents is a good 



starting point. 
Chapter 
1 



Contents 



Introduction to document retrieval systems, auto- 
mated text processing, search techniques, file 
organization, and disk storage devices 
Survey of logical file organizations (sequential, 
chained, inverted, calculated access, clustered) 
and physical organizations (serial, direct, indexed 
sequential) 

Detailed description of clustered files including 
classification schemes, hierarchy structure, 
profile definition , search strategies, updating: 
procedures, storage considerations , query cluster- 
ing, and alternate uses of the cluster hierarchy 
(underlined topics are the basis of experiments 
in later chapters) 

Development of evaluation procedures; description 
of the experimental collections 
Experiments with profile definition, specifically 
examining standard and rank value profiles, search 
bias, vector length, frequency considerations, as 



306 



0 

ERJC 



33t 



X-2 



Chapter Contents Cont*d. „ 

well as unweighted and partially weighted vectors 

6 Experiments related to updating a clustered document 
collection (profile maintenance schemes and rate 

of hierarchy degeneration) 

7 Experiments with schemes for storing the hierarchy 
on disk in conjunction with ISAM indices? develop- 
ment of a disk storage and retrieval simulation 
model 

8 Experiments with automatic query alteration using 
information within the profile hierarchy 

9 Comparison of clustered and inverted files with 
respect to storage, speed, and quality of retrieved 
output 

10 Summary and conclusions 

The results of this work are applicable to various types of research 
in information retrieval. Within the SMART project, these efforts produced 
a new document collection— —the Cranfield 14-00 stem. The parameters for 
the collection preparation and a summary of its properties are contained 
in Chapter IV and Appendix A. In addition. Chapter VI contains an algorltnm 
for splitting a collection into special test subcollections needed for 

these and other experiments (see Appendix B) . 

On a higher level, some of the genuine contributions of this work 
reside in its test and evaluation methods. The cluster- oriented evalua- 
tion scheme developed in Chapter IV and used throughout is unique in that 
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it is independent of search strategy, accounts more accurately for system 
effort (disk accesses), and cost relatively little to run. Chapter V 
Introduces the concept of biased search results and sets forth a procedure 
for detecting bias with respect to a given profile property. The technique 
can be extended to other properties and many types of searches (full, 
inverted, etc.) and should be useful in many experimental setups. Chapter 
VII utilises an interesting disk storage algorithm and simulation scheme 
for examining questions related to search speed and space utilization. 
Finally, the work in Chapter VIII includes a new way of computing similarity 
matrices, namely by summing matrices for subsets of desired items. 

A large number of results are directly related to the use of clustered 
document files. These are explained in detail and adequately summarised 
in Chapters V to VIII. Only a general summary will be given here. First, 
the experiments in Chapter V show that it is possible to make a reasonably 
accurate, economical cluster profile, which is free from correlation 
domination and bias. Specifically, its term weights should be based on 
frequency ranks and therefore bo non-decreasing with the number of occur- 
rences, A large number of low weighted terms can be deleted with a con- 
siderable saving in storage space. Second, the most satisfactory update 
procedure in a clustered file alters the weights of only existing profile 
terms. Even so, the hierarchy degenerates with the addition of new items 
and should undergo at least partial re-clustering when it increases 2 ^- 50 ^ 
in size, (This percentage is figured as the ratio of additions to the 
current file size.) Third, a cluster hierarchy should be stored by levels 
to facilitate rapid searching. Finally,' term substitutes within profiles 
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should not be used to automatically alter requests as described in Chapter 
VIII. However, there are other alternate ways of using the cluster 
hierarchy which can help justify the expense of document classification. 

These results do not suggest that no further improvements might be 
made in profile definition. For example, a large discrepancy remains 
between the best achievable performance curve (not the ideal curve) and 
those actually obtained in these experiments. It is felt that additional 
improvements can be made, perhaps by using partial weighting techniques 
or a completely different scheme altogether. Some problems undoubtedly 
are related to the document indexing. If changes are made in profiles, 
the optimal updating scheme may change also and the general hierarchy 
quality might become more sensitive to new additions. With less specula- 
tion it can be said that there is a genuine need for developing new, 
additional uses for a cluster hierarchy. A few are suggested in Chapter 
III > for example, using a hierarchy in selective dissemination of informa- 
tion and document browsing. 

In the final analysis, this investigation attempts to answer the 
question "Is a clustered file organization suitable for on-line document 
retrieval?" t Part of the answer is obtained from the comparison with in- 
verted organization in Chapter IX. A clustered file is found to compare 
favorably in terms of search speed and storage economy. Search precision 
is less, but compensated by a flexible level of recall (low or high). 

In general, the clustered organization provides a great deal of flexibility, 
allowing any type of request-document matching, search strategy, or feedback. 
Part of this is due to the fact that the entire document remains intact in 
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storage, rather than being split up and stored in pieces. Thus all 
information is available for use by matching coefficients, feedback 
schemes, etc, Furthermore, it is certain that on-line retrieval must 
move away from making arbitrary accesses into a data base for individual 
records, A clustered file solves this problem by concentrating, those 
records with a high probability of satisfying a request, in a few disk 
areas. Therein lies its greatest value. 
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Appendix A 
Common Word List 

The forms of the Cranfield document and query collection used in this 
research are produced by removing common words from the original document 
and query texts and by applying a analysis scheme to reduce variants of a 
word to the same stem . The common word list (restriction list) includes 
360 prepositions, pronouns, conjunctions, and verbs of the following types, 

1, Prepositions (of, on, at, in, , • ,) 

2, Pronouns 

a) Personal (he, she, they, • • «) 

b) Possessive (his, hers, my, , , ,) 

c) Reflexive (myself, herself, , « «) 

d) Interrogative (who, which, , , ,) 

e) Demonstrative (this, that, • . ,) 

f) Indefinite (all, any, both, each, many, , , .) 

3, Conjunctions 

a) Coordination (and, but, or, , « •) 

b) Correlative (either, whether, not only, • * .) 

c) Subordinating (after, because, how, unless, . « .) 

4, Verbs 

a) Auxiliaries and their forms (be, do, have, can, 
may, . . .) 

b) Non-content (begin, choose, make, come, give, keep, 
meet, put, say, see, show, take, . * •) 

3 U 
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5. Other 



a) Individual letters 

b) Punctuation 

c) Numbers 

The suffix removal program uses the standard SMART suffix list>and will not 
be described here. 
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Appendix B 

I 

Subcollections for the Updating Experiments 

In order to conduct the updating experiments in Chapter VI, the 

♦ 

Cranfield documents are separated into subsets with special properties. 
Initially the 1400 documents are divided into halves— sets A and B— each 
containing documents chosen in such a way that half the relevant for each 
query lie in each subset. Then, the B subcollection is split halves 
again— set C and D— each containing one quarter of the total relevant 
for each query. The documents in sets A, 0, and D are listed below; the 
algorithm by which they are derived is given in Chapter VI, 



A subcollection 
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6 
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10 


13 


15 


17 


19 


21 


24 


26 


28 


30 


32 


34 


37 


39 


41 


43 


.44 


46 


48 


50 


52 


54 


56 


59 


62 


63 


64 


65 


67 


69 


70 


72 


74 


76 


78 


80 


82 


84 


86 


88 


90 


92 


94 


96 


97 


99 


101 


103 


105 


10? 


109 


111 


113 


114 


116 


120 


122 


124 


126 


129 


131 


133 


135 


137 


140 


142 


143 


146 


148 


151 


153 


155 


156 


157 


159 


161 


163 


165 


167 


169 


171 


173 


175 


177 


179 


181 
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185 


187 


189 


192 


194 
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Appendix G 

Confirmation Test Evaluation Curves 

Section 9 of Chapter V summarizes a set of tests confirming that pro- 
files for various cluster hierarchies behave in approximately the same 
way. That is, the preliminary conclusions drawn from the experiments on 
Hierarchy 1 are supported by the experiments on Hierarchies 2 and 3. Those 

preliminary conclusions arei 

a) profile term weights based on frequency ranks are 
superior to those based on frequency counts 

b) a large number of low weight profile terms can be de- 
leted without adversely affecting performance 

P*(& - -1)SSP*; and 

c) shortened unweighted profiles are roughly equivalent to 
weighted profiles P*( 6 » -l)»P*(6 “ -l). 

In accor dan ce with the scheme set up in Chapter IV, the confirmation tests 
involve both cluster-oriented evaluation (PR-PF data from levels 1 and Z) 
as well as SMART evaluation P-R data from narrow and broad searches) . In 
each of these four trials, an attempt is made to fairly judge the relative 
merit of four profile types in the three hierarchies. Thus, there are a 

total of 12 plots of performance, each showing k curves, A consistent 

* 

notation is used in these plots which is explained in Table C-l along with 
some properties of each hierarchy. It is worth remarking that the 
absolute values of the performance measures differ substantially among 
the collections. However, the primary concern here is with the relative 
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positions of performance curves and. not the actual measured values; hence 
different scales are of secondary Importance. 

First, consider the RC-PF plot for any collection and hierarchy level 
(Fig-ares C-l, C-2, C-5, C-6, C-9, C-10). Corresponding points on the 4 
curves represent the same amount of system effort — the average number of 
disk accesses to expand one additional cluster (amount not shown) —and 
its placement Indicates the resulting recall ceiling and precision floor 
of the search . Therefore, the curves can be compared on a point-to-point 
basis and profiles ranked accordingly. Since the curves do not usually 
overlap, the judgments are made with considerable confidence. Ranks are 
listed beneath each figure j a composite ranking is shown in Table V-5. As 
in all comparisons there must be a criteria for determining when curves are 
significantly different. Here, a 2#-4^ difference is considered significant, 
this amount being about half that used in some SMART experiments. However, 

4 times as many queries are used in these tests, so the confidence level 
for the conclusions remains about the same in both cases. 

Accurate judgments are a bit more difficult to make using the SMART 
precision-recall curves and normalized measures (Figures C-3, C-4, C-?, 

C-8, C-ll, C-12) . The basic problem is that of comparing searches involv- 
ing the same amount of work. Unfortunately, it is impossible to ascertain 
the number of disk accesses per search. The number of profile and document 
correlations is available instead and is shown beneath each figure. Small 
differences in the number of document correlations can be neglected since 
entire clusters of items are fetched at a time. Differences in the number 
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Symbol 




Description 


o 


Denotes curves for P^ profiles— term weights are pro- 




portional to frequency counts 


o 


Denotes curves for P* profiles— term weights are pro- 




portional to frequency ranks 


□ 


Denotes curves for P* 




p*, but includes deletion of terms with low weights 


A 


Denotes curves for P* 




vectors based on P# 


NR 


normalized recall 


NP 


normalized precision 


C(x) 


average number of correlations on hierarchy level x 


RANK 


relative evaluation rank of profile types taking 




into account performance and search effort 



a) Notation Used Throughout Confirmation Tests 



Hierarchy 


Level 


Number of 
Profiles 


Average Length 
Before Deletion 


Average Length 
After Deletion 

(6 = -1) 


1 


1 


13 


812 


141 


(17*) 


1 


2 


55 


323 


70 


(22#) 


2 


1 


6 


908 


20? 


(23#) 


2 


2 


94 


311 


69 


(22*) 


3 


1 


28 


526 


103 


(20*) 


3 


2 


103 


197 


47 


(2*1*) 



b) Selected Properties of Profiles 



Table C-l 



o 

ERIC 
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of profile correlations are more serious since these records are obtained 
from more or less arbitrary disk locations (l access per node). Consequent- 
ly, an equal number of profile correlations is more important in determining 
expenditure of M equal system effort.'* Obviously, the most desirable 
profile provides a superior precision-recall curve for the smallest amount 
of work. As a practical matter in these comparisons, it is necessary to 
decide not only what performance difference is significant, but also when 
superior performance (P-R) must be downgraded because of excess search 
effort. Here a 2& difference in normalized measures or a W difference 
in P-R curves is considered significant and is offset only by one or 
fewer profile correlations in the next lower ranking search. 

The following evaluation curves are presented using these methods 
for determining the relative merit of each profile type in the indicated 
collection. A summary of the rankings and a discussion of the test 
conclusions are given in Section V,9. 
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Symbol Profile Type Rank 

Op 3 

o A i 

O P*(6 - -1) 2 

A p*(6 » -1) 4 

Confirmation Test Results— Cluster-oriented Evaluation 

Hierarchy 1, Level 1 

Figure C-l 
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Oonfirnation Test Results— Cluster-oriented Evaluation 

Hierarchy 1, Level 2 

Figure C-2 
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Confirmation Test 


Results- 


-SMART Evaluation 







Hierarchy 1, Narrow Search 



Figure C-3 
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Confirmation Test Results— SMART Evaluation 
Hierarchy 1# Broad Search 

Figure C-4 
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Confirmatlon Test Results — Cluster- oriented Evaluation 

Hierarchy 2, Level 1 

Figure C-5 
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Confirmation 
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Level 2 





Figure C-6 
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Confirmation Test Results— SMART Evaluation 
Hierarchy 2, Narrow Search 

Figure C-? 
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Confirmation Test Results— SMART Evaluation 
Hierarchy 2, Broad Search 



Figure C-8 
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Confirmation Test Results— Cluster-oriented Evaluation 
Hierarchy 3» Level 1 

Figure C-9 
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Confirmation Test Results — Clustei>oriented Evaluation 

Hierarchy 3, Level 2 
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Confirmation Test Results— SMART Evaluation 
Hierarchy 3» Narrow Search 



Figure C-ll 
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Confirmation Test Results — SMART Evaluation 
Hierarchy 3, Broad Search 



Figure C-12 
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